Kettle: Sourcing data from Hadoop Hive ~ Community Server

Pentaho Data Integration (Kettle): Sourcing data from Hadoop Hive

Tutorial Details

Software: PDI/Kettle 4.1 (download here)
Knowledge: Beginner

Has your company recently started using Hadoop to cope with enormous amounts of data? Have you been using Kettle so far for your ETL? As you are probably aware of, with the Kettle Enterprise Edition you can now create map-reduce jobs for Hadoop. If you want to stay with the open source version, the good news is, that it’s very simple to connect to Hive - a database which can be set up on top of Hadoop.

If you have one of the latest versions of Kettle installed, you will see that it comes already with the required Hive driver. Hence, setting up a connection to Hive is straight forward.

Create a new database connection

Create a new transformation
Click the View tab on the right hand side
Right click on Database connections and choose New

Alternatively, you can also activate the Design tab and drag and drop a Table input step on the canvas, open it and click on New to create a new database connection.

In the database settings window choose Generic database from the available databases.

For the connection URL insert the following:

jdbc:hive://youramazonec2url:10000

For the driver specify:
org.apache.hadoop.hive.jdbc.HiveDriver

Depending on your setup you might have to provide other details as well.

Click Test and all should be working now.
If you want to use any specific settings or user defined functions (UDF), then you can call them as follows:

In the database settings window, click on Advanced in the left hand pane.
Insert these statements in the field entitled Enter the SQL statements (separated by ;) to exectue right after connecting
Click Test again to check if a connection can be created with these specific settings

Everything is set up now for querying Hive from Kettle. Enjoy!

Community Server

Thursday, 26 May 2011

Kettle: Sourcing data from Hadoop Hive

0 comments:

Post a Comment

Popular Posts

Categories

Blog Archive

About Me