Saturday, 27 November 2010

Kettle: Handling Dates with Regular Expression

Posted on 10:47 by Unknown

Kettle: Handling Dates with Regular Expression
This is the second tutorial that focuses on using regular expressions with Pentaho Kettle (PDI). This is again a demonstration on how powerful regular expression acutally are.

In this example we are dealing with a simplified data set containing date values. The problem is, that the date doesn't have a standard format. We have following values:

date
2010/1/2
2010/01/2
2010/1/02
2010/01/02
20100102

We assume for now, that all follow at least this basic standard: year, month, day. Now, we need to somehow generate a standard date in the format yyyy-MM-dd using the value from the date field as input.

Regular expressions are of much help here, as we can use the concept of capturing groups. Capturing groups will allow us to retrieve the year, month and day parts of the date easily, then we build together our final date (we want it in the yyyy-MM-dd format). Now let's see how this is done:
For our convenience, we use the data grid step to store our dummy dates for testing:

Now we are ready to go ahead and work on our regular expression. Insert the Regex Evaluation step from the Scripting folder and configure it as shown in the screenshot below:
On the content tab make sure that you select "Permit whitespace and comments in pattern". Now let's have a look at the regular expression:

The string must start with 4 numbers. We enclose the definition by brackets to create the first capturing group. Note: I added #1. This is a comment and helps to mark the capturing groups for easy reference.
Next we say that a dash can follow or not. This is our 2nd capturing group.
I guess you get the idea for the remaining capturing groups. In the end we make sure that nothing else follows, hence we use the dollar sign.
Once we have created our capturing groups, we can reference them in the "Capture Group Fields". First make sure that "Create fields from capture" is activated. Then fill out the fields below as shown.

This step will basically get the year, month and day parts.

Now we have to build the date together. We can use the formula step therefore:

First we check if a value exists, if not, we set the field to 0s. In case a value exists, we check if the month and day part have a leading zero, if not, we add it. Finally, we build the whole date string together.

Now, it is not a very good approach to save non standard 0000-00-00 dates in the database (i.e.), hence we use the "Null if ..." step to set records with 0000-00-00 date to null:

The final step left to do is to convert the string to a proper date. You can do this with the "Select values" step. Go to the Meta-data tab and fill it out as shown below:
Save the transformation and run it. Now let's have a look at the output:

You can see that it is quite simple to handle non standardized data with regular expressions. You can download the transformation here.

Posted in | No comments

Friday, 26 November 2010

Review "Pentaho Kettle Solutions"

Posted on 14:28 by Unknown

A short review of the "Pentaho Kettle Solutions" book

Matt Casters, Roland Bouman and Jos van Dongen's Kettle bible was released about 3 months ago and I finally managed to finish reading it (600+ pages!!!).

This is the second book about Kettle: At the beginning of this year María Carina Roldán released "Pentaho 3.2 Data Integration: Beginner's Guide" (my review here), which is written in quite a hands-on/step by step approach. "Pentaho Kettle Solutions" is more a theoretical piece of work, mixed with a lot of examples (jobs and transformation that you can download from the accompanying website). It takes the bigger context and tries to map Kettle features to industry standards (namely Ralph Kimball's subsystem). So instead of saying: Kettle can do this, the book speaks about the data warehousing requirements and identifies Kettle steps that meet these requirements. Important topics like creating star schemas and loading a data warehouse are explained. The next chapters talk you through working with OLAP data, the ETL development lifecycle, versioning, auditing, linage, performance tuning, parallelization, clustering and partitioning, .... it explains how to access all the various data sources (also including examples of SOAP, RSS, etc), how to create your own Kettle plugin and so much more that I won't be able to fit it all in a short review. It definitely is the ultimate resource for anything to do with Kettle.

Now, which book about Kettle should you get?

Answer 1: Both. These books are very different in what they are trying to bring across. They are not really overlapping, so it makes sense reading both.
Answer 2: If you want to have a quick start in a practical step by step fashion, get "Pentaho 3.2 Data Integration: Beginner's Guide"
Answer 3: If you want to understand the bigger picture, then go for "Pentaho Kettle Solutions".

Overall, I have to say "THANKS" to the authors for providing such an extensive book about Kettle! It's been a pleasure to read and it's extremely informative! There is only one thing left to say: Go out and get it ;)

P.S.: Roland Bouman and Jos van Dongen are also the authors of "Pentaho Solutions", another excellent book that you want to have in your library.

Posted in | No comments

Thursday, 18 November 2010

Pentaho Kettle Data Input: Pivoted Data

Posted on 07:58 by Unknown

Pentaho Kettle Data Input: Pivoted Data
I found quite an interesting example about importing a pivoted data set with an unknown amount of columns on the Pentaho forum. Matt Casters demonstrated a neat solution on how to import this data set. As this example is quite hidden away in the forum, I thought it would be a good idea to publish a short tutorial based on it.

Our data set looks like this (download here) (Screenshot shows only part of the whole data set):
The data was originally retrieved from Euribor.

The "problem" with this data set is that we don't really know how many columns there are in advance (as the date is in the columns). We have to find some way to unpivot the data set.

Matt's solution (published 2008) looks like this (download here):
Let us have a look at the steps:

Excel Input: In the content tab make sure you disable header. In the fields tab, name the first field "type" and all the other ones something like "col1"-"col366".

A tip by Matt to speed up this process:
"One missing tip I'd like to add to your nice post is that it's really tedious to enter hundreds of nearly identical lines in the data grids in Kettle. To make this easier you can use a spreadsheet to help you. Simply copy one or more lines from a Kettle table (select the lines, click right and select copy lines) into the spreadsheet (click right in the spreadsheet and select paste).
Then you can select a line in the spreadsheet and drag the area down to allow col1, col2, col3 ... col366 to be generated automatically. Finally, copy/paste the rows back into Kettle including the header row."

That should make entering large amounts of columns a matter or a few minutes, not hours
As we don't know in advance how many columns there are, the best method is to be prepared for the maximum amount of columns. Make sure that you set all columns to the format String. Also, set the decimal and grouping signs accordingly. Once you have filled out all the necessary settings, hit "Preview rows" to make sure everything is interpreted as it should be.

Filter rows: Here we check if col1 is not null, which just basically makes sure that there are any figures in the data set

rownr: We add a sequence so that we know the row number of each data row

Row Normaliser: Now we unpivot the data. We create a new grouping field called colnr to achieve this. Our data looks now like this:
No empty data: In this step we want to discard any row that has no data

Grab a date: This short JavaScript basically makes sure that each record gets the correct date assigned:

var store;

var coldate;

if (rownr == 1) {
  if (colnr==1) {
var store = new Array() // only init once!
  }
  // Store the date for the column number
  //
  store[colnr] = str2date(data, "yyyy/MM/dd HH:mm:ss.SSS");
}

coldate = store[colnr];

Drop first data raw: Now that we have the date available for each record, we can drop the records that we retrieved from the first row of the original data set.

The last two steps basically choose which fields we want to keep and do a proper conversion to the respective data types.

Posted in | No comments

Friday, 12 November 2010

Using regular expressions with Pentah...

Posted on 07:16 by Unknown

Using regular expressions with Pentaho Data Integration (Kettle)
There are quite some transformations steps that allow you to work with regular expressions: Replace in string, Regex Evaluation, Modified Java Script Value to name a few.

Today we will look at one (pretty unrealistic) example using the "Replace in string" step. Basically, we will cut the first word off a text (Note: This can be achieved by using other steps as well, but as we want to focus on regular expressions, this approach was chosen):

Our first step is of the type "Data Grid", which easily allows us to define a data set to work with. In our case, we create a data point called "line_text" and add values to it: In the Meta tab you can define your data points and in the Data tab the actual values. Hit Preview to get an idea how your data set looks like (no surprises here).
To cut the first word off our line_text values, we use a "Replace in string" step. Choose line_text as "In stream field". You can leave "Out stream field" empty in case you want the output to go to line_text (so the original value will be overwritten with the new one). If you want to keep the old value, define a new data point name here.
For this example we will use a regular expression, so make sure you set "use RegEx" to Y. We will use following expression:

^([a-zA-z]+)(\s{1})(.*)$

This expression makes use of capturing groups (defined by brackets), which will allow us to specify the part that we want to keep. What this regular expression says is basically the following: the value should start with (indicated by ^) any combination of upper or smaller letters (one or more of them indicated by +). This is followed by one space character (indicated by \s{1}) and then basically any character or none can follow. Probably we could even improve this expression, but for now it does what it is supposed to do.

In the "Replace with" field we enter $3, which basically tells Kettle that we want to keep the part that is retrieved by the 3rd capturing group.

Finally, we output the results to a text file. The output looks like this:

Some info about using regular expressions you can find here and here and here and ...
You can find the XML code for the transformation here.

If you are creating extremely long regular expressions, named capturing groups are quite useful. As the name indicates, you can assign a name to the capturing group and use this name later on in the back reference. Currently, Kettle doesn't support named capturing groups, but I am sure a future version will cover this.

Posted in | No comments

Monday, 1 November 2010

PDI Kettle Plugins

Posted on 14:18 by Unknown

Pentaho Data Integration Plugins

Agile BI

This extremly useful plugin can be downloaded from the Pentaho website (plugin for PDI 4, plugin for PDI 4.1 RC1). [Due to the fact that this plugin is not open source, PDI doesn't have it installed by default. Pentaho was so kind to make it available for the community version for free.]

Unzip the file into the data-integration/plugins/spoon directory. It will create a folder in there named agile-bi. Start spoon and the new capabilities will automatically be available.
Once you have done this, fire up Spoon, create a transformation which outputs the data to a completely denormalized table.
Once there is data in this table, right click on the table output step, choose Model.
In the model view, you can click the "Auto populates model with default dimensions and measures" icon. If this doesn't do a decent job generating your model, you can always change it manually.
Once you have properly prepared your model, save it and return to the data integration perspective.
Right click on the table output step again and choose Visualize > Analyzer. In this perspective you can fully dig into your data and discover any problems. For example, my data set has a country data point. The values are supposed to be full country names, but I realize that somehow in my data "AR" shows up instead of "Argentina". So I can go back to the data integration perspective, do the necessary changes to the transformation, save it, run it again, go back to the Analyzer, refresh the data and I can see that now all my country values are valid. This is an absolute time saver and very efficient approach to quality checking your data.

There are a couple of outer things you can do with this plugin as well, i.e. create a report with the wizard known from the Report Designer.

Kettle Franchising Factory

This is a very interesting project to give you a framework for ETL development. The project description reads as follows:

The Kettle Franchising Factory (KFF) adds on top of the existing kettle platform the necessary tools to open multiple data integration restaurants in a rapid, flexible and organised way. KFF allows you to deploy a large series of data integration solutions (multi-customer, multi-solution) in a fully standardized way.

KFF is composed of:
Kettle plugins
re-usable transformations/jobs
logging/scheduling framework
standards
naming conventions
best practices for set-up
directory structures

I hope that we see further development on it. You can find a presentation about it here and download it here.

Kettle Cookbook

Another very promising project initiated by Roland Bouman (the co-author of the excellent Pentahos Solution books): This is not really a plugin, but a job that auto generates a documentation based on the description you added to your steps, jobs, etc. So there are no excuses any more not to create a documentation! Have look here for more info.

Pentaho Report Output Step

Matt Casters made this step available: It allows you to pass data points to a PRPT (Pentaho Report). You can specify where the report template is located, to which directory and in which format the report should be outputted and also specify report parameters. You can use this in simple scenarios where you just want to output a single report and an more complex fashion for report bursting i.e..

Please find more information about it here.

Excel 2007 XLSX Output Step

Slawo was so kind to provide this step which will be of much use if you are mainly working with newer versions of MS Office. You can find more info here.

As you see, plugins can add some interesting features to Kettle which facilitate our work enormously!

Posted in | No comments

Tuesday, 28 September 2010

Mondrian MDX and Schema Validation Difference PDR and Schema Workbench

Posted on 05:47 by Unknown

PDR 3.6.1 doesn't seem to behave the same way in processing MDX queries and XML Schemas than other tools (Schema Workbench, JPivot, etc).

In example, CurrentDateMember([Date.Weekly Calendar], '[yyyy]\.[ww]').Lag(3.0) works perfectly in Schema Workbench, but not in PDR. PDR requires a more accurate writing style: CurrentDateMember([Date.Weekly Calendar], '["Date.Weekly Calendar"]\.[yyyy]\.[ww]').Lag(3.0). So whereas Schema Workbench and JPivot can figure out where to find year and week, PDR needs the exact path.

Another example: The formula of a calculated member looks like this in the Schema: [Subscription Base Revenue Share Deals (Actual)]/[Subscription Base (Actual)]. If you use JPivot or Schema Workbench, everything will work perfectly. But PDR requires the formula to be like this: [Measures].[Subscription Base Revenue Share Deals (Actual)]/[Measures].[Subscription Base (Actual)].

For both example, the syntax that PDR requires is the really accurate one. I would really appreciate if Schema Workbench (version 3.2.0.13661) would have the same strict validation as well. This would avoid a lot of confusion as to why we get so many more error messages in PDR.

When discussing this topic on the Mondrian developer mailing list, Julian Hyde commented the following:
"It looks like PRD is using mondrian to validate formulas. I suspect that it is an earlier version of Mondrian, which had weaker validation rules. I don't recall why we made the change, but people will log bugs that MDX succeeds in SSAS and fails in mondrian, and we will (rightly) change mondrian.

Qualifying members with their dimension & hierarchy name is recommended. Mondrian can resolve members faster if you do.

We can't give an error if people don't qualify member names. But should we emit a warning if someone writes [Store Sales] / [Store Cost] in a formula? I don't have a strong opinion either way."

As Thomas pointed out in the comment below, have a look at the mondrian.properties file located in the PDR folder report-designer\resources. You can find various settings there like this one:

mondrian.olap.elements.NeedDimensionPrefix=true

This seems to be the one that stopped my "not so accurate" MDX queries to run. I do not recommend changing this setting though, but advise to write precise MDX queries and make sure that the calculated members in your Schema have to complete reference as well.

UPDATE 2010/10/04:

Make sure the PDR mondrian.properties file is set up the same way as the BI Server and Schema Workbench ones.

Open prd-ce-3.6.1-stable\report-designer\resources\mondrian.properties and look for the settings shown below (these ones, especially the first one, will have a major impact):

###############################################################################

# Property determines if elements of dimension (levels, hierarchies, members)

# need to be prefixed with dimension name in MDX query.

# For example when the property is true, the following queries

# will error out. The same queries will work when this property

# is set to false.

# * select {[M]} on 0 from sales

# * select {[USA]} on 0 from sales

# * select {[USA].[CA].[Santa Monica]} on 0 from sales

# When the property is set to true, any query where elements are

# prefixed with dimension name as below will work

# * select {[Gender].[F]} on 0 from sales

# * select {[Customers].[Santa Monica]} on 0 from sales

# Please note that this property does not govern the behaviour where in

# * [Gender].[M]

# is resolved into a fully qualified

# * [Gender].[All Gender].[M]

# In a scenario where the schema is very large and dimensions have large

# number of members a MDX query that has a invalid member in it will cause

# mondrian to to go through all the dimensions, levels, hierarchies, members

# and properties trying to resolve the element name. This behaviour consumes

# considerable time and resources on the server. Setting this property to

# true will make it fail fast in a scenario where it is desirable

mondrian.olap.elements.NeedDimensionPrefix=true

Don't change this one: It's important that your Schema and MDX has properly defined syntax.

###############################################################################

# Boolean property indicating whether errors related to non-existent members

# should be ignored during schema load. If so, the non-existent member is

# treated as a null member.

mondrian.rolap.ignoreInvalidMembers=true

Change to: mondrian.rolap.ignoreInvalidMembers=false

###############################################################################

# Boolean property indicating whether errors related to non-existent members

# should be ignored during query validation. If so, the non-existent member is

# treated as a null member.

mondrian.rolap.ignoreInvalidMembersDuringQuery=true

Change to: mondrian.rolap.ignoreInvalidMembersDuringQuery=false

Now open psw-ce-3.2.0.13661\schema-workbench\mondrian\properties and add the above highlighted properties from the PRD properties file (if these properties already exist, amend them so that they are exactly set the same way).

In my example mondrian.rolap.ignoreInvalidMembers was already in the Schema Workbench mondrian.properties file and properly set to "false" (so no changes necessary). I also added the two other properties.

Posted in "Pentaho Report Designer", "Schema Workbench", Mondrian | No comments

Tuesday, 17 August 2010

Pentaho Solutions - The Pentaho bible

Posted on 12:59 by Unknown

I finally finished reading the 600 pages book "Pentaho Solutions" (by Roland Bouman and Jos van Dongen). I don't want to repeat what all the other reviews are clearly stating in detail, let me just say this: If you are new to the Pentaho BI world, this is the book you should buy to get to know everything about all the components that make up the Pentaho BI suite. The book doesn't stop there, but also gives you a primer on BI and data warehouse design in general, a SQL, OLAP and MDX intoduction and much more ... it's really all that you will not need to set up proper BI solutions! It is also a great resource for people who have been working with Pentaho for a bit longer.
I do hope that Roland and Jos will keep this Pentaho bible up-to-date with all future major versions of the Pentaho BI Suite.

After reading this book I am really looking forward to the forthcoming book "Pentaho Kettle Solutions" (also written by Roland and Jos in cooperation with Matt Casters).

Posted in | No comments

Community Server

Saturday, 27 November 2010

Kettle: Handling Dates with Regular Expression

Kettle: Handling Dates with Regular Expression

Friday, 26 November 2010

Review "Pentaho Kettle Solutions"

A short review of the "Pentaho Kettle Solutions" book

Thursday, 18 November 2010

Pentaho Kettle Data Input: Pivoted Data

Pentaho Kettle Data Input: Pivoted Data

Friday, 12 November 2010

Using regular expressions with Pentah...

Using regular expressions with Pentaho Data Integration (Kettle)

Monday, 1 November 2010

PDI Kettle Plugins

Pentaho Data Integration Plugins

Agile BI

Kettle Franchising Factory

Kettle Cookbook

Pentaho Report Output Step

Excel 2007 XLSX Output Step

Tuesday, 28 September 2010

Mondrian MDX and Schema Validation Difference PDR and Schema Workbench

Tuesday, 17 August 2010

Pentaho Solutions - The Pentaho bible

Popular Posts

Categories

Blog Archive

About Me