Pentaho Data Integration: Best practice solutions for working wit ~ Community Server

Pentaho Data Integration: Best practice solutions for working with huge data sets

Assign enough memory

Open pan.sh and kitchen.sh (and spoon.sh if you use the GUI) in a text editor and assign as much memory as possible.

Data input

Data from a database table can only be imported as fast as the database allows it. You cannot run multiple copies of a database table input step.
Text files are more convenient, as you can copy them across servers (your cluster) and read them in simultaneously. Another really cool feature that Kettle has is that you can read in text files in parallel (check the "Run in parallel" option in the step configuration and specify the number of copies to start with in the step context menu). How does this work? If your file has a size of 4 GB and you specified 4 copies of the text step, each step will read in chunks of that file at the same time (to be more precise: the first copy will start reading at the beginning of the file, the second copy will start at the first line that is found after 1 GB, the third copy will start at the first line that is found after 2 GB and so on).

Run multiple step copies (Scale up)

Find out first how many cores your server has, as it doesn’t make sense assigning more copies than there are cores available.
Linux: cat /proc/cpuinfo
Have a look how many times processor is mentioned. The first processor has the id 0.

Make sure you test your transformations first! You might get unexpected results (see my other blog post for more details).

Usually you should get better performance by specifying the same amount of copies for steps where possible (herewith you create dedicated data pipelines and Kettle doesn’t have to do the work of sending the rows in a round robin fashion to the other step copies).

Run a cluster (Scale out)

You can run your ETL job on multiple slave servers. Kettle allows you to specify a cluster via the GUI. For this to work, you have to set up Carte first, which comes with PDI (see my other blog post for details).
Of course, you can combine scale out and scale up methods.

Adjust Sort Rows step

Make sure you set a proper limit for Sort size (rows in memory) and/or maybe as well Free memory Threshold (in %).

You can run the Sort Rows step in multiple copies, but make sure you add a Sorted Merge step after it. This one will combine and sort the results in a streaming fashion.

Use UDJC instead of Javascript

Java code will execute faster than javascript, hence use the User Definied Java Class step instead of the Javascript step where possible.

Community Server

Tuesday, 15 February 2011