Community Server

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Thursday, 18 November 2010

Pentaho Kettle Data Input: Pivoted Data

Posted on 07:58 by Unknown

Pentaho Kettle Data Input: Pivoted Data

I found quite an interesting example about importing a pivoted data set with an unknown amount of columns on the Pentaho forum. Matt Casters demonstrated a neat solution on how to import this data set. As this example is quite hidden away in the forum, I thought it would be a good idea to publish a short tutorial based on it.

Our data set looks like this (download here) (Screenshot shows only part of the whole data set):
The data was originally retrieved from Euribor.

The "problem" with this data set is that we don't really know how many columns there are in advance (as the date is in the columns). We have to find some way to unpivot the data set. 

Matt's solution (published 2008) looks like this (download here):
Let us have a look at the steps:

Excel Input: In the content tab make sure you disable header. In the fields tab, name the first field "type" and all the other ones something like "col1"-"col366".

A tip by Matt to speed up this process:
"One missing tip I'd like to add to your nice post is that it's really tedious to enter hundreds of nearly identical lines in the data grids in Kettle. To make this easier you can use a spreadsheet to help you. Simply copy one or more lines from a Kettle table (select the lines, click right and select copy lines) into the spreadsheet (click right in the spreadsheet and select paste).
Then you can select a line in the spreadsheet and drag the area down to allow col1, col2, col3 ... col366 to be generated automatically. Finally, copy/paste the rows back into Kettle including the header row."

That should make entering large amounts of columns a matter or a few minutes, not hours
As we don't know in advance how many columns there are, the best method is to be prepared for the maximum amount of columns. Make sure that you set all columns to the format String. Also, set the decimal and grouping signs accordingly. Once you have filled out all the necessary settings, hit "Preview rows" to make sure everything is interpreted as it should be.

Filter rows: Here we check if col1 is not null, which just basically makes sure that there are any figures in the data set

rownr: We add a sequence so that we know the row number of each data row

Row Normaliser: Now we unpivot the data. We create a new grouping field called colnr to achieve this. Our data looks now like this:
No empty data: In this step we want to discard any row that has no data

Grab a date: This short JavaScript basically makes sure that each record gets the correct date assigned:

var store;

var coldate;

if (rownr == 1) {
  if (colnr==1) {
var store = new Array() // only init once!
  }
  // Store the date for the column number
  // 
  store[colnr] = str2date(data, "yyyy/MM/dd HH:mm:ss.SSS");
}

coldate = store[colnr];

Drop first data raw: Now that we have the date available for each record, we can drop the records that we retrieved from the first row of the original data set.

The last two steps basically choose which fields we want to keep and do a proper conversion to the respective data types.
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Pentaho Kettle Parameters and Variables: Tips and Tricks
    Pentaho Kettle Parameters and Variables: Tips and Tricks This blog post is not intended to be a formal introduction to using parameters and ...
  • Using Parameters in Pentaho Report Designer
    Using Parameters in Pentaho Report Designer Introduction How to define a parameter Additional info about the new parameter dialog Hidden Par...
  • Pentaho Data Integration: Scheduling and command line arguments
    Pentaho Data Integration (Kettle): Command line arguments and scheduling Tutorial Details Software: PDI/Kettle 4.1 (download here ), MySQL S...
  • Jaspersoft iReport: How to pass a parameter to a sub-dataset
    Jaspersoft iReport: How to pass a parameter to a sub-dataset Let’s say our main report is grouped by continent and the details band holds sa...
  • Using regular expressions with Pentah...
    Using regular expressions with Pentaho Data Integration (Kettle) There are quite some transformations steps that allow you to work with regu...
  • Pentaho Data Integration and Infinidb Series: Bulk Upload
    Pentaho Data Integration and InfiniDB Series: Bulk Upload Introduction Prepare Tables Using mainly Kettle steps Check if file exists Setup I...
  • Pentaho Data Integration: Remote execution with Carte
    Pentaho Data Integration: Remote execution with Carte Tutorial Details Software: PDI/Kettle 4.1 (download  here ), installed on your PC and ...
  • How to create a loop in Pentaho Kettle
    I finished my first ever video tutorial! This video will demonstrate you how easy it is to create a loop in Pentaho Kettle. Enjoy!
  • Metadata Driven ETL and Reporting
    Metadata Driven ETL and Reporting with Pentaho Data Integration and Report Designer Tutorial Details  Software : If PDI Kettle 4.2 GA and PR...
  • Understanding the Pentaho Kettle Dimension Insert/Update Step Null Value Behaviour
    We will be using a very simple sample transformation to test the null value behaviour: We use the Data Grid step to provide some sample dat...

Categories

  • "Bulk Loader"
  • "Bulk Loading"
  • "Hadoop"
  • "Kettle"
  • "Pentaho Book"
  • "Pentaho Data Integration"
  • "Pentaho Kettle"
  • "Pentaho Report Designer MDX MQL JDBC Parameters How To"
  • "Pentaho Report Designer MDX Parameters"
  • "Pentaho Report Designer MQL Parameters"
  • "Pentaho Report Designer Parmaters"
  • "Pentaho Report Designer"
  • "Pentaho Reporting 3.5 for Java Developers"
  • "Pentaho Reporting Book"
  • "Routing"
  • "Schema Workbench"
  • "Testing"
  • "Unicode"
  • "Unit testing"
  • "UTF8"
  • Agile development
  • automated testing
  • Big Data
  • Book Review
  • C-Tools
  • CBF
  • Clustered transformation
  • Command Line Arguments
  • Community Build Framework
  • D3JS
  • Dashboarding
  • Data Integration
  • Data Warehouse
  • Database Change Management
  • Database Version Control
  • Date Dimension
  • DBFit
  • ETL
  • ETLFit
  • Federated database
  • Google Charts
  • Google Visualization API
  • Hadoop
  • HTML5
  • iReport
  • JasperReports
  • JasperSoft
  • JasperStudio
  • Kettle
  • Kimball
  • Loop
  • Master data management
  • Metadata
  • Metedata editor
  • Mondrian
  • multidimensional modeling
  • OLAP
  • Open Source
  • Parameter
  • Parameters
  • Pentaho
  • Pentaho BI Server
  • Pentaho Data Integration
  • Pentaho Data Integration 4 Cookbook
  • Pentaho Kettle
  • Pentaho Metadata Editor Tutorial
  • Pentaho Report Designer
  • PostgreSQL
  • PRD
  • Report Layout
  • REST
  • Routing
  • Saiku
  • Scheduling
  • Slowly Changing Dimension
  • Sqitch
  • SVG
  • Talend
  • Talend MDM
  • Talend Open Studio
  • Tutorial
  • Variable
  • Web service
  • Xactions

Blog Archive

  • ►  2013 (24)
    • ►  December (2)
    • ►  November (3)
    • ►  October (2)
    • ►  September (1)
    • ►  August (3)
    • ►  July (2)
    • ►  June (1)
    • ►  May (2)
    • ►  April (1)
    • ►  March (3)
    • ►  February (1)
    • ►  January (3)
  • ►  2012 (20)
    • ►  November (3)
    • ►  October (3)
    • ►  August (1)
    • ►  June (1)
    • ►  April (1)
    • ►  March (3)
    • ►  February (5)
    • ►  January (3)
  • ►  2011 (19)
    • ►  November (3)
    • ►  July (2)
    • ►  June (1)
    • ►  May (4)
    • ►  April (2)
    • ►  March (1)
    • ►  February (3)
    • ►  January (3)
  • ▼  2010 (17)
    • ►  December (1)
    • ▼  November (6)
      • How to Set Up Pentaho Community Build Framework
      • Kettle: Handling Dates with Regular Expression
      • Review "Pentaho Kettle Solutions"
      • Pentaho Kettle Data Input: Pivoted Data
      • Using regular expressions with Pentah...
      • PDI Kettle Plugins
    • ►  September (1)
    • ►  August (1)
    • ►  June (2)
    • ►  May (1)
    • ►  April (3)
    • ►  February (1)
    • ►  January (1)
  • ►  2009 (18)
    • ►  December (3)
    • ►  November (1)
    • ►  October (5)
    • ►  September (7)
    • ►  July (2)
Powered by Blogger.

About Me

Unknown
View my complete profile