Community Server

Saturday, 19 March 2011

Pentaho Data Integration: Scheduling and command line arguments

Posted on 16:01 by Unknown

Pentaho Data Integration (Kettle): Command line arguments and scheduling

Tutorial Details

Software: PDI/Kettle 4.1 (download here), MySQL Server (download here)
Knowledge: Intermediate (To follow this tutorial you should have good knowledge of the software and hence not every single step will be described)
OS: Linux or Mac OS X
Tutorial files can be downloaded here


Approach to provide arguments to one step in a transformation
Approach to provide arguments to more transformations and jobs
Using named Parameters
Scheduling a job on Linux 

Once you tested your transformations and jobs there comes the time when you have to schedule them. You want to have a certain amount of flexibility when executing your Pentaho Data Integration/Kettle jobs and transformations. This is where command line arguments come in quite handy.

A quite common example is to provide the start and end date for a SQL query that imports the raw data. Kettle makes it very easy actually to set this up.
Approach to provide arguments to one step in a transformation

If you just need the arguments for one step only, then you can use the Get System Info step and create a hop to your Database Input step.

We will be working with following data set:

Open your favourite SQL Client (and start your MySQL server if it is not running yet) and issue following SQL statements:

USE
test
;

DROP TABLE IF EXISTS
`sales`
;

CREATE TABLE
`sales`
(
`date` DATETIME,
`product_type` VARCHAR(45),
`sales` INT(255)
)
;

INSERT INTO
`sales`
VALUES
('2010-01-20 00:00:00','Shoes',234),
('2010-01-20 00:00:00','Cheese',456),
('2010-01-21 00:00:00','Shoes',256),
('2010-01-21 00:00:00','Cheese',156),
('2010-01-22 00:00:00','Shoes',535),
('2010-01-23 00:00:00','Cheese',433)
;

SELECT
*
FROM
`sales`
;

CREATE TABLE
`sales_staging`
(
`date` DATETIME,
`product_type` VARCHAR(45),
`sales` INT(255)
)
;

Our goal is to provide the start and end date arguments to our SQL query.

Now let's create our transformation:

Open Kettle and create a new transformation
Drag and drop a Get System Info step on the canvas. You can find it in the Input folder on the left hand side.
Double click on it and populate the names column in the grid with start_date and end_date.
For the type choose command line argument 1 and command line argument 2 respectively

Now add a Table input step and a Table output step (we keep it very simple). Create hops between all these steps in the order that they were mentioned.

Double Click on the Table Input step and populate the SQL field with the query shown below:

SELECT
date
, product_type
, sales
FROM sales
WHERE
date>=? AND
date<?
;

You can feed the start and end date from the Get System Info step to a Table Input step and use the start and end date in the WHERE clause of your SQL query (highlighted in yellow). The question marks will be replaced on execution by the start and end date (but make sure they are defined in this order in the Get System Info step).

Make sure that you enable Replace variables in script? and choose the Get System Info step for Insert data from step.

Define a New ... connection (Connection Name: Localhost, Connection Type: MySQL, Host Name: localhost, Database Name: test, Port Number: 3306, your user name and password).

Click OK. The hop between the Get System Info step and the Table Input step now also displays an info icon.

And this is all that you have to do: Your transformation now accepts command line arguments!

So now let's try to execute the transformation from the command line. Close all the files that we just created, then open your Terminal window.

My transformation is located in:
/Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/command_line_arguments/tr_get_command_line_arguments.ktr

To execute a transformation from the command line, we have to call pan.sh, which is located in my case in:

/Applications/Dev/PDI\ 4.1\ RC1/pan.sh

Use following approach to execute the transformation (replace the file paths by yours):

Change to the PDI directory:

cd /Applications/Dev/PDI\ 4.1\ RC1/

Use the super user and provide the password:

sudo su

Issue following command (replace yellow highlighted paths with your paths):

./pan.sh -file='/Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/command_line_arguments/tr_get_command_line_arguments.ktr' '2010-01-20 00:00:00' '2010-01-22 00:00:00' -Level=Basic > /Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/command_line_arguments/tr_get_command_line_arguments.err.log

Command line parameters (highlighted in red) have to be mentioned after the file argument. Mention them in the order that you expect them to be received in Kettle. The level argument (highlighted in blue) specifies the logging level. Following levels are available (from the most detailed one to the least detailed one): Rowlevel, Detailed, Debug, Basic, Minimal, Error, Nothing.

Pan accepts many more arguments, i.e. to connect to the repository. Please have a look at the Pan User Documentation for all the details.

Once the command is issued and you get no error message returned (check the error file), let's check the data that got exported to our output table:

As you can see from the screenshot above, only the data covering our specified timeframe got processed.

Approach to provide arguments to more transformations and jobs
If you plan to use the command line arguments in more than one step and/or more transformations, the important point is that you will have to do this in a separate transformation which has to be executed before the transformation(s) that require(s) these variables. Let’s call this transformation Set Variables.

The Set Variables transformation has two steps:

Get System Info: It allows you to define variables that are expected to come from the command line
Set Variables: This one will then set the variables for the execution within Kettle, so that you can use them in the next transformation that is specified in your job.

Note: The command line arguments enter Kettle as a String. In some cases the variable is expected to be of a certain data type. Then you will have to use the Get Variable step in a succeeding transformation to define the specific data type for each variable.

There is no additional adjustment needed. Do not fill out the Parameters tab in the Transformation properties or Job properties with these variables!

We can now change our main transformation to make use of these variables.

A typical scenario would be the following: Our ETL process populates a data warehouse (DWH). Before we insert the compiled data into the DWH, we want to make sure that the same data doesn't already exist in it. Hence we decide, that we just want to execute a delete statement that clears the way before we add the newly compiled data.

Our ETL job will do this:

Initialise the variables that are used through the job (done in a dedicated transformation)
Delete any existing DWH entries for the same time period (done in a SQL job entry)
Main ETL transformation

Let's start:

Create a new transformation and call it tr_set_variables.ktr
Drag and drop a Get System Info step on the canvas. You can find it in the Input folder on the left hand side.
Double click on it and populate the names column in the grid with start_date and end_date.
For the type choose command line argument 1 and command line argument 2 respectively
Next drag and drop the Set Variables step from the Job Folder onto the canvas and create a hop from the Get System Info step to this one.
Double click the Set Variables step and click on Get Fields:
Clicking Get Fields will automatically define all input fields as variables. If you don't need all, just delete the relevant rows. In our case we want to keep all of them. Kettle will also automatically capitalize the variable names. As I want to avoid any confusion later on, I explicitly prefix my variables in Kettle with VAR_. You can also define scope type and set a default value.

We have now create a transformation that accepts command line arguments and sets them as variables for the whole job.

Next, let's create the main ETL transformation:

Open tr_get_command_line_arguments (which we created earlier on) and save it as tr_populate_staging_tables.
Delete the Get System Info step. We don't need this step any more as we define the variables already in tr_set_variables.
Double click the Table input step. As our variable can be referenced by names, we have to replace the question marks (?) with our variable names like this:

SELECT
date
, product_type
, sales
FROM sales
WHERE
date>="${VAR_START_DATE}" AND
date<"${VAR_END_DATE}"
;
The variables are now enclosed by quotation marks as we want the date to be treated as string.
Click Ok and save the transformation.
As we have our transformations finished now, we can start creating a job that executes our transformations in a defined order (We will keep this job rather simple. I suggest adding error handling):
Create a new job and name it jb_populate_staging_tables.
Insert following job entries in the order specified and connect them with hops:

Start entry
Tranformation entry: Double click on it and choose tr_set_variables.ktr as the Transformation filename.
From the Script Folder choose the Execute SQL script ... job entry: Define a New ... connection (Connection Name: Localhost, Connection Type: MySQL, Host Name: localhost, Database Name: test, Port Number: 3306, your user name and password). Tick Use variable substitution?. Insert following query:

DELETE FROM
sales_staging
WHERE
date>="${VAR_START_DATE}" AND
date<"${VAR_END_DATE}"
;
Pay attention to the where clause: The variables are now enclosed by quotation marks as we want the date to be treated as string. Also note that the date restriction is exactly the same as the one we use for the raw data import.
Transformation entry: Double click on it and choose tr_populate_staging_tables.ktr as the Transformation filename.
So now let's try to execute the transformation from the command line. Close all the files that we just created, then open your Terminal window.

My job is located in:
/Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/command_line_arguments/jb_populate_staging_tables.kjb

To execute a job from the command line, we have to call kitchen.sh, which is located in my case in:

/Applications/Dev/PDI\ 4.1\ RC1/pan.sh

Use following approach to execute the transformation (replace the file paths by yours):

Change to the PDI directory:

cd /Applications/Dev/PDI\ 4.1\ RC1/

Use the super user and provide the password:

sudo su

Issue following command (replace yellow highlighted paths with your paths):

./kitchen.sh -file='/Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/command_line_arguments/jb_populate_staging_tables.kjb' '2010-01-20 00:00:00' '2010-01-22 00:00:00' -Level=Basic > /Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/command_line_arguments/jb_populate_staging_tables.err.log

Command line parameters (highlighted in red) have to be mentioned after the file argument. Mention them in the order that you expect them to be received in Kettle. The level argument (highlighted in blue) specifies the logging level. Following levels are available (from the most detailed one to the least detailed one): Rowlevel, Detailed, Debug, Basic, Minimal, Error, Nothing.

Kitchen accepts many more arguments, i.e. to connect to the repository. Please have a look at the Kitchen User Documentation for all the details.

Inspect the error log to see if the job ran successfully. Then have a look at the staging table to see if the data got imported.

Using named Parameters
Named parameters are special in the sense that they are explicitly named command line arguments. If you pass on a lot of arguments to your Kettle job or transformation, it might help to assign those values to an explicitly named parameter.

Named Parameters have following advantages:

On the command line you assign the value directly to a parameter, hence there is zero chance of a mix-up.
A default value can be defined for a named parameter
A description can be provided for a named parameter
No need for an additional transformation that sets the variables for the job

Let's reuse the job that we created in the previous example:

Open jb_populate_staging_tables.kjb and save it as tr_populate_staging_tables_using_named_params.kjb.
Delete the Set Variables job entry and create a hub from the Start to the Execute SQL script entry.
Click CTRL+J to call the Job properties dialog.
Click on the Parameters tab and specify the parameters like this:
In our case we don't define a default value. The reason for this is that we don't want to import any raw data in case there is no start and end date defined.
Click Ok and save the job.

Our job is completely set up. Let's execute it on the command line:

./kitchen.sh -file='/Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/command_line_arguments/jb_populate_staging_tables_using_named_params.kjb' -param:VAR_START_DATE='2010-01-20 00:00:00' -param:VAR_END_DATE='2010-01-22 00:00:00' -Level=Basic > /Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/command_line_arguments/jb_populate_staging_tables_using_named_params.err.log

I described the various kitchen arguments in the previous section, so I won't repeat it here. The only difference here are the named parameters (highlighted in yellow).

Inspect the error log to see if the job ran successfully. Then have a look at the staging table to see if the data got imported.

As you can see, named parameters are the crème de la crème!

Scheduling a job on Linux
Now that we have quite intensively explored the possibilites of passing command line arguments to Kettle, it's time to have a look at scheduling:

On Linux crontab is a popular utility that allows to schedule processes. I will not explain crontab here, if you are new to it and want to find out more about it, have a look here.

Our plan is to schedule a job to run every day at 23:00. We pass on two command line arguments to this job: the start and the end datetime. It's required that this job imports each time the raw data of the last two days (23:00 to 23:00). To calculate the start and end date for the raw data processing we will write a shell script. The plan is to schedule this shell script using crontab.

You can edit the crontab by issuing following command:

crontab -e

This will display any scheduled processes. If you are familiar with vi, you can use the same commands here to edit and save. Click i to insert the following:

00 23 * * * /jb_populate_staging_tables_daily.sh

Press ESC followed by :wq to save and exit crontab.

Navigate to the folder where you saved all the jobs and transformations. Create this shell script with vi and name it jb_populate_staging_tables_daily.sh:

cd /Applications/Dev/PDI\ 4.1\ RC1;./kitchen.sh -file='/Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/command_line_arguments/jb_populate_staging_tables.kjb' "`date --date='2 days ago' '+%Y-%m-%d 23:00:00'`" "`date --date='1 day ago' '+%Y-%m-%d 23:00:00'`" -Level=Basic > populate_staging_tables_daily.err.log

Note: We enclosed our arguments with double quotes. We used enclosing back ticks to indicate that a shell command has to be executed. There is also a blank in our argument, which we enclosed by using single quotes (otherwise Linux is expecting another argument).

Our job is now scheduled. Make sure that you check after the first run for any errors.

In this article you learnt about creating flexible jobs and transformations by using command line arguments. We also had a quick look at scheduling your jobs. I hope this article demonstrated that it is quite easy to set this up.

Posted in | No comments

Wednesday, 23 February 2011

Pentaho Data Integration: Designing a highly available scalable s

Posted on 12:45 by Unknown

Pentaho Data Integration: Designing a highly available scalable solution for processing files

Tutorial Details

Software: PDI/Kettle 4.1 (download here) and MySQL server, both installed on your PC.
Knowledge: Intermediate (To follow this tutorial you should have good knowledge of the software and hence not every single step will be described)
Files: Download from here


Preparation
Set up slave servers
On Windows
On Linux
Define Slave Servers
Monitor Slave Servers
Distribution of the workload
Test the whole process 

The raw data generated each year is increasing significantly. Nowadays we are dealing with huge amounts of data that have to be processed by our ETL jobs. Pentaho Data Integration/Kettle offers quite some interesting features that allow clustered processing of data. The goal of this session is to explain how to spread the ETL workload across multiple slave servers. Matt Casters originally provided the ETL files and background knowledge. My way to thank him is to provide this documentation.
Note: This solutions runs ETL jobs on a cluster in parallel independently from each other. It's like saying: Do exactly the same thing on another server without bothering what the other jobs do on the other servers. Kettle allows a different setup as well, where ETL processes running on different servers can send i.e. the result set to a master (so that all data is combined) for sorting and further processing (We will not cover this in this session).

Preparation
Download the accompanying files from here and extract them in a directory of your choice. Keep the folder structure as is, with the root folder named ETL. If you inspect the folders, you will see that there is:

an input directory, which stores all the files that our ETL process has to process
an archive directory, which will store the processed files
a PDI directory, which holds all the Kettle transformations and jobs

Create the ETL_HOME variable in the kettle.properties file (which can be found in C:\Users\dsteiner\.kettle\kettle.properties)
ETL_HOME=C\:\\ETL

Adjust the directory path accordingly.

Navigate to the simple-jndi folder in the data-integration directory. Open jdbc.properties and add following lines at the end of the file (change if necessary):

etl/type=javax.sql.DataSource
etl/driver=com.mysql.jdbc.Driver
etl/url=jdbc:mysql://localhost/etl
etl/user=root
etl/password=root

Next, start your local MySQL server and open your favourite SQL client. Run following statement:

CREATE SCHEMA etl;
USE etl;
CREATE TABLE FILE_QUEUE
(
filename VARCHAR(256)
, assigned_slave VARCHAR(256)
, finished_status VARCHAR(20)
, queued_date DATETIME
, started_date DATETIME
, finished_date DATETIME
)
;

This table will allow us to keep track of the files that have to be processed.
Set up slave servers

For the purpose of this exercise, we keep it simple and have all slaves running on localhost. Setting up a proper cluster is out of this scope of this tutorial.
Kettle comes with a lightweight web server called Carte. All that Carte does is listen to incoming job or transformation calls and process them. For more info about Carte have a look at my Carte tutorial.
So all we have to do now is to start our Carte instances:
On Windows
On the command line, navigate to the data-integration directory within your PDI folder. Run following command:
carte.bat localhost 8081

Open a new command line window and do exactly the same, but now for port 8082.
Do the same for port 8083 and 8084.
On Linux
Open Terminal and navigate to the data-integration directory within your PDI folder. Run following command:

sh carte.sh localhost 8081

Proceed by doing exactly the same, but now for ports 8082, 8083 and 8084.

Now that all our slave servers are running, we can have a look at the ETL process ...

Populate File Queue

Start Kettle (spoon.bat or spoon.sh) and open Populate file queue.ktr (You can find it in the /ETL/PDI folder). If you have Kettle (Spoon) already running, restart it so that the changes can take effect and then open the file.

This transformation will get all filenames from the input directory (of the files that haven't been processed yet) and stores them in the FILE_QUEUE table.

Double click on the Get file names step and click Show filename(s) … . This should show all your file names properly:

Click Close and then OK.
Next, double click on the Lookup entry in FILE_QUEUE step. Click on Edit next to Connection. Then click Test to see if a working connection can be established (the connection details that you defined in the jdbc.properties file will be used):

Click 3 times OK.
Everything should be configured now for this transformation, so hit the Play/Execute Button.

In your favourite SQL Client run following query to see the records that were inserted by the transformation:

SELECT
*
FROM
FILE_QUEUE
;

Once you inspected the results run:

DELETE FROM FILE_QUEUE ;

The idea is that this process is scheduled to run continuously (every 10 seconds or so).
Define Slave Servers
First off, we will create a table that stores all the details about our slave servers. In your favourite SQL Client run following statement:

CREATE TABLE SLAVE_LIST
(
id INT
, name VARCHAR(100)
, hostname VARCHAR(100)
, username VARCHAR(100)
, password VARCHAR(100)
, port VARCHAR(10)
, status_url VARCHAR(255)
, last_check_date DATETIME
, last_active_date DATETIME
, max_load INT
, last_load INT
, active CHAR(1)
, response_time INT
)
;

Open the Initialize SLAVE_LIST transformation in Kettle. The transformation allows to easily configure our slave server definition. We use the Data Grid step to define all the slave details:

id
name
hostname
username
passoword
port
status_url
last_check_date: keep empty, will be populated later on
last_active_date: keep empty, will be populated later on
max_load
last_load: keep empty, will be populated later on
active
response_time: keep empty, will be populated later on

As you can see, we use an Update step: This allows us to basically change the configuration at any time in the Data Grid step and then rerun the transformation.

Execute the transformation.

Run following statement in your favourite SQL client :

SELECT * FROM SLAVE_LIST ;
Monitor Slave Servers
Features:

Checks the status of the active slave servers defined in the SLAVE_LIST
Checks wether they are still active
Calculates the load (number of active jobs) per slave server
Calculates the response time

Open following files in Kettle:

Slave server status check.kjb

Get SLAVE_LIST rows.ktr: Retrieves the list of defined slave servers (including all details) and copies rows the to the result set
Update slave server status.kjb: This job gets executed for each input row (=loop)

Set slave server variables.ktr: receives each time one input row coming from the result set of Get SLAVE_LIST rows.ktr and defines variables for each field, which can be used in the next transformation
Check slave server.ktr:

Checks whether the server is not available or status is 200 and then updates the SLAVE_LIST table with an inactive flag and last_check_date.
If the server is available:

it extracts the amount of running jobs from the xml file returned by the server
keeps the response time
sets an active flag
sets the last_check_date and last_active_date (both derived from the system date)
updates the SLAVE_LIST table with this info

Please find below the combined screenshots of the jobs and transformation:
So let's check if all our slave servers are available: Run Slave server status check.kjb. Run the following statement in your favourite SQL client:

SELECT * FROM SLAVE_LIST;

The result set should now look like on the screenshot below, indicating that all our slave servers are active:
Distribution of the workload
So far we know the names of the files to be processed and the available servers, so the next step is to create a job that processes one file from the queue on each available slave server.

Open Process one queued file.kjb in Kettle.

This job does the following:

It checks if there are any slave servers available that can handle more work. This takes the current and maximum specified load (the number of active jobs) into account.
It loops over the available slave servers and processes one file on each server.
In the case that no slave servers are available, it waits 20 seconds.

Some more details about the job steps:

Any slave servers available?: This fires a query against our SLAVE_LIST table which will return a count of the available slave servers. If the count is bigger than 0, Select slave servers is the next step in the process.
Select slave servers: This transformation retrieves a list of available slave servers from the SLAVE_LIST table and copies it to the result set.
Process a file: The important point here is that this job will be executed for each result set row (double click on the step, click on the Advanced tab and you will see that Execute for each input row is checked).

Open Process a file.kjb.

This job does the following:

For each input row from the previous result set, it sets the slave server variables.
It checks if there are any files to be processed by firing a query against our FILE_QUEUE table.
It retrieves the name of the next file that has to be processed from the FILL_QUEUE table (a SQL query with the limit set to 1) and sets the file name as a variable
It updates the record for this file name in the FILE_QUEUE table with the start date and assigned slave name.
The Load a file job is started on the selected slave server : Replace the dummy Javascript step in this job with your normal ETL process that populates your data warehouse. Some additional hints:

The file name is available in the ${FILENAME} variable. It might be quite useful to store the file name in the target staging table.
In case the ETL process failed once before for this file, you can use this ${FILENAME} variable as well to automatically delete the corresponding record(s) from the FILE_QUEUE table and even the staging table prior to execution.
Note that the slave server is set in the Job Settings > Advanced tab:

If the process is successful, the record in the FILE_QUEUE table gets flagged respectively. The same happens in case the process fails.

Test the whole process
Let's verify that everything is working: Open Main Job.kjb in Kettle and click the Execute icon.

Now let's see if 4 files were moved to the archive folder:

This looks good ... so let's also check if our FILE_QUEUE table has been updated correctly. Run following statement in your favourite SQL client:

Note that the last file (file5.csv) hasn't been processed. This is because we set up 4 slave servers and our job is passing 4 files to each of the slave servers. The last file will be processed with the next execution.

Posted in | No comments

Community Server

Saturday, 19 March 2011

Pentaho Data Integration: Scheduling and command line arguments

Approach to provide arguments to one step in a transformation

Approach to provide arguments to more transformations and jobs

Using named Parameters

Scheduling a job on Linux

Wednesday, 23 February 2011

Pentaho Data Integration: Designing a highly available scalable s

Set up slave servers

On Windows

On Linux

Define Slave Servers

Monitor Slave Servers

Distribution of the workload

Test the whole process

Popular Posts

Categories

Blog Archive

About Me