Tuesday, 27 August 2019

Pentaho Architecture

What is Pentaho Client-server Architecture?

       Let's take an example we have many jobs in our full data pipeline if will save everything in the local system it is not the great idea for that Pentaho has client-server architecture means you have a client is installed in your machine and you can connect with the server lets say in the cloud you have your server and at a time many of the clients can connect and create and run their jobs using the same server.


What you have to do for that?

    Create one cloud instance, for example, ect2 instance in the Amazon cloud in that machine you have to install Pentaho as we did in our local machine same thing we have to do here we have to extract our package inside the machine. After that, we have to install a database (any RDBMS database you can use like MySQL or PostgreSQL ), for example, we want to create a repository in the database so many users can connect with that repository at a time.

What we need to do for the Pentaho repository?

    After database installation,  we have to create one database and username password for that. In Pentaho, we have to do some configurations in our Pentaho kettle properties file. You will get that properties file inside  ~/.kettle directory in your system.

If already exist repositories.xml inside in your folder you can edit that file or you can create a new file
with the name repositories.xml


<?xml version="1.0" encoding="UTF-8"?>
<repositories>
  <connection>
    <name>Your Database Repository Name</name>
    <server>Your machine IP address</server>
    <type>Your Data base type(POSTGRESQL)</type>
    <access>Native</access>
    <database>Database Name</database>
    <port>Database port</port>
    <username>Database user name </username>
    <password>Database User Password</password>
    <servername/>
    <data_tablespace/>
    <index_tablespace/>
    <attributes>
      <attribute><code>FORCE_IDENTIFIERS_TO_LOWERCASE</code><attribute>N</attribute></attribute>
      <attribute><code>FORCE_IDENTIFIERS_TO_UPPERCASE</code><attribute>N</attribute></attribute>
      <attribute><code>IS_CLUSTERED</code><attribute>N</attribute></attribute>
      <attribute><code>PORT_NUMBER</code><attribute>5432</attribute></attribute>
      <attribute><code>PRESERVE_RESERVED_WORD_CASE</code><attribute>Y</attribute></attribute>
      <attribute><code>QUOTE_ALL_FIELDS</code><attribute>N</attribute></attribute>
      <attribute><code>SUPPORTS_BOOLEAN_DATA_TYPE</code><attribute>Y</attribute></attribute>
      <attribute><code>SUPPORTS_TIMESTAMP_DATA_TYPE</code><attribute>Y</attribute></attribute>
      <attribute><code>USE_POOLING</code><attribute>N</attribute></attribute>
    </attributes>
  </connection>
  <repository>    <id>KettleDatabaseRepository</id>
    <name>Repository Name</name>
    <description>Database repository</description>
    <is_default>true</is_default>
    <connection>Repository Name</connection>
  </repository>  <repository>    <id>KettleDatabaseRepository</id>
    <name>Repository Name</name>
    <description>Database repository</description>
    <is_default>false</is_default>
    <connection>Repository Name</connection>

  </repository>  </repositories>


 After that, you have to restart your carte server in your server machine using the command,

#cd /usr/local/data-integration
#nohup sh carte.sh 0.0.0.0 8181 > "/usr/local/data-integration/carte_logs/carte.err.log" &


Now your server is up and running you can run any job inside the server or you can connect
the client machine with the database repository and run the job in the client as well.

How to connect from client to server machine?

Click on connect button

Click on repository manager


Click add to create a new repository


Click on get started

Create repository here gives the name to your repository mention your server URL and save.                 


After that, you will get the name of your created directory in your connect option.                              

    When you will connect with the repository it will ask username and password of your repository fill the user name and password and you will be able to connect with the repository.                             

Now everything is done and your client-server architecture of the Pentaho is ready to use. You can create a job or transformation and save inside your root directory or you can create a new directory as you like it up to you.                                                                                                                                                                                                                                                     
   















Monday, 26 August 2019

Pentaho data integration


What is Pentaho?

Pentaho is an ETL tool for data engineering. Pentaho has both community and enterprise edition it depends on your company and project requirement how you want to implement it in your system. It's a very easy and user-friendly tool for the ETL process. You can create transformations and Job to handle your ETL tasks. In new versions of Pentaho now we have Bigdata plugins as well so if want to work in Hadoop we can still create our jobs in Pentaho.


Here I am describing the community edition of Pentaho!

Where to Download?
https://community.hitachivantara.com/s/article/downloads

How to install?
You have to just unzip the package in your machine in your preferred location where you want to put.


How Pentaho works?

We can use Pentaho in client and server both ends. In the client machine we have to install data integration and after that in one click only it's easy to use.



After that, you can just create a new job or a new transformation.


Here you have all options and you have to choose and just drag and drop inside the transformation.


For example, you have an excel file and you want to put that data into your database table you can get both input and output from the design tab and create the transformation.

In both you will get options like details of the input and output you have to fill that information and your transformation is ready to go.

After that save your transformation in your local and run that you will get the logs in your screen.

If it fails you can check the logs why it's failing and if you successfully ran the job you can check your database table.

Same we can do with the Job.

So we have many options here for ETL we can design as per our requirement.

If any driver is not available inside the Pentaho we can install the JDBC drivers and we can put inside the Pentaho directory.

Let's take an example if you are trying to connect with the PostgreSQL database and you are getting an error message that driver is missing in that case you have to download the Postgresql JDBC driver and put that driver inside the lib directory.


Restart your Pentaho and its ready to go now you can connect with your database same you have to follow for other databases.

If you will create a full pipeline for your ETL process your full Pentaho job will look like below image.




many jobs and transformations are included in the above example.

You can run your job in local or in the Pentaho carte server.

For Client-Server architecture of the Pentaho, you can read my another blog.