Community Server

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Tuesday, 15 February 2011

Pentaho Data Integration: Best practice solutions for working wit

Posted on 07:39 by Unknown

Pentaho Data Integration: Best practice solutions for working with huge data sets

Assign enough memory

Open pan.sh and kitchen.sh (and spoon.sh if you use the GUI) in a text editor and assign as much memory as possible.

Data input

Data from a database table can only be imported as fast as the database allows it. You cannot run multiple copies of a database table input step.
Text files are more convenient, as you can copy them across servers (your cluster) and read them in simultaneously. Another really cool feature that Kettle has is that you can read in text files in parallel (check the "Run in parallel" option in the step configuration and specify the number of copies to start with in the step context menu). How does this work? If your file has a size of 4 GB and you specified 4 copies of the text step, each step will read in chunks of that file at the same time (to be more precise: the first copy will start reading at the beginning of the file, the second copy will start at the first line that is found after 1 GB, the third copy will start at the first line that is found after 2 GB and so on).

Run multiple step copies (Scale up)

Find out first how many cores your server has, as it doesn’t make sense assigning more copies than there are cores available.
Linux: cat /proc/cpuinfo
Have a look how many times processor is mentioned. The first processor has the id 0.

Make sure you test your transformations first! You might get unexpected results (see my other blog post for more details).

Usually you should get better performance by specifying the same amount of copies for steps where possible (herewith you create dedicated data pipelines and Kettle doesn’t have to do the work of sending the rows in a round robin fashion to the other step copies).

Run a cluster (Scale out)

You can run your ETL job on multiple slave servers. Kettle allows you to specify a cluster via the GUI. For this to work, you have to set up Carte first, which comes with PDI (see my other blog post for details).
Of course, you can combine scale out and scale up methods.

Adjust Sort Rows step

Make sure you set a proper limit for Sort size (rows in memory) and/or maybe as well Free memory Threshold (in %).

You can run the Sort Rows step in multiple copies, but make sure you add a Sorted Merge step after it. This one will combine and sort the results in a streaming fashion.

Use UDJC instead of Javascript

Java code will execute faster than javascript, hence use the User Definied Java Class step instead of the Javascript step where possible.

Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Pentaho Kettle Parameters and Variables: Tips and Tricks
    Pentaho Kettle Parameters and Variables: Tips and Tricks This blog post is not intended to be a formal introduction to using parameters and ...
  • Using Parameters in Pentaho Report Designer
    Using Parameters in Pentaho Report Designer Introduction How to define a parameter Additional info about the new parameter dialog Hidden Par...
  • Pentaho Data Integration: Scheduling and command line arguments
    Pentaho Data Integration (Kettle): Command line arguments and scheduling Tutorial Details Software: PDI/Kettle 4.1 (download here ), MySQL S...
  • Jaspersoft iReport: How to pass a parameter to a sub-dataset
    Jaspersoft iReport: How to pass a parameter to a sub-dataset Let’s say our main report is grouped by continent and the details band holds sa...
  • Using regular expressions with Pentah...
    Using regular expressions with Pentaho Data Integration (Kettle) There are quite some transformations steps that allow you to work with regu...
  • Pentaho Data Integration and Infinidb Series: Bulk Upload
    Pentaho Data Integration and InfiniDB Series: Bulk Upload Introduction Prepare Tables Using mainly Kettle steps Check if file exists Setup I...
  • Pentaho Data Integration: Remote execution with Carte
    Pentaho Data Integration: Remote execution with Carte Tutorial Details Software: PDI/Kettle 4.1 (download  here ), installed on your PC and ...
  • How to create a loop in Pentaho Kettle
    I finished my first ever video tutorial! This video will demonstrate you how easy it is to create a loop in Pentaho Kettle. Enjoy!
  • How to Set Up Pentaho Community Build Framework
    How to Set Up Pentaho Community Build Framework  Introduction Pentaho BI Server Setup with CBF Java, Tomcat and Ant Set environment variable...
  • Metadata Driven ETL and Reporting
    Metadata Driven ETL and Reporting with Pentaho Data Integration and Report Designer Tutorial Details  Software : If PDI Kettle 4.2 GA and PR...

Categories

  • "Bulk Loader"
  • "Bulk Loading"
  • "Hadoop"
  • "Kettle"
  • "Pentaho Book"
  • "Pentaho Data Integration"
  • "Pentaho Kettle"
  • "Pentaho Report Designer MDX MQL JDBC Parameters How To"
  • "Pentaho Report Designer MDX Parameters"
  • "Pentaho Report Designer MQL Parameters"
  • "Pentaho Report Designer Parmaters"
  • "Pentaho Report Designer"
  • "Pentaho Reporting 3.5 for Java Developers"
  • "Pentaho Reporting Book"
  • "Routing"
  • "Schema Workbench"
  • "Testing"
  • "Unicode"
  • "Unit testing"
  • "UTF8"
  • Agile development
  • automated testing
  • Big Data
  • Book Review
  • C-Tools
  • CBF
  • Clustered transformation
  • Command Line Arguments
  • Community Build Framework
  • D3JS
  • Dashboarding
  • Data Integration
  • Data Warehouse
  • Database Change Management
  • Database Version Control
  • Date Dimension
  • DBFit
  • ETL
  • ETLFit
  • Federated database
  • Google Charts
  • Google Visualization API
  • Hadoop
  • HTML5
  • iReport
  • JasperReports
  • JasperSoft
  • JasperStudio
  • Kettle
  • Kimball
  • Loop
  • Master data management
  • Metadata
  • Metedata editor
  • Mondrian
  • multidimensional modeling
  • OLAP
  • Open Source
  • Parameter
  • Parameters
  • Pentaho
  • Pentaho BI Server
  • Pentaho Data Integration
  • Pentaho Data Integration 4 Cookbook
  • Pentaho Kettle
  • Pentaho Metadata Editor Tutorial
  • Pentaho Report Designer
  • PostgreSQL
  • PRD
  • Report Layout
  • REST
  • Routing
  • Saiku
  • Scheduling
  • Slowly Changing Dimension
  • Sqitch
  • SVG
  • Talend
  • Talend MDM
  • Talend Open Studio
  • Tutorial
  • Variable
  • Web service
  • Xactions

Blog Archive

  • ►  2013 (24)
    • ►  December (2)
    • ►  November (3)
    • ►  October (2)
    • ►  September (1)
    • ►  August (3)
    • ►  July (2)
    • ►  June (1)
    • ►  May (2)
    • ►  April (1)
    • ►  March (3)
    • ►  February (1)
    • ►  January (3)
  • ►  2012 (20)
    • ►  November (3)
    • ►  October (3)
    • ►  August (1)
    • ►  June (1)
    • ►  April (1)
    • ►  March (3)
    • ►  February (5)
    • ►  January (3)
  • ▼  2011 (19)
    • ►  November (3)
    • ►  July (2)
    • ►  June (1)
    • ►  May (4)
    • ►  April (2)
    • ►  March (1)
    • ▼  February (3)
      • Pentaho Data Integration: Designing a highly avail...
      • Pentaho Data Integration: Best practice solutions ...
      • Pentaho Metadata Editor: Joining two fact tables
    • ►  January (3)
  • ►  2010 (17)
    • ►  December (1)
    • ►  November (6)
    • ►  September (1)
    • ►  August (1)
    • ►  June (2)
    • ►  May (1)
    • ►  April (3)
    • ►  February (1)
    • ►  January (1)
  • ►  2009 (18)
    • ►  December (3)
    • ►  November (1)
    • ►  October (5)
    • ►  September (7)
    • ►  July (2)
Powered by Blogger.

About Me

Unknown
View my complete profile