Friday, 12 November 2010

Using regular expressions with Pentah...

Posted on 07:16 by Unknown

Using regular expressions with Pentaho Data Integration (Kettle)
There are quite some transformations steps that allow you to work with regular expressions: Replace in string, Regex Evaluation, Modified Java Script Value to name a few.

Today we will look at one (pretty unrealistic) example using the "Replace in string" step. Basically, we will cut the first word off a text (Note: This can be achieved by using other steps as well, but as we want to focus on regular expressions, this approach was chosen):

Our first step is of the type "Data Grid", which easily allows us to define a data set to work with. In our case, we create a data point called "line_text" and add values to it: In the Meta tab you can define your data points and in the Data tab the actual values. Hit Preview to get an idea how your data set looks like (no surprises here).
To cut the first word off our line_text values, we use a "Replace in string" step. Choose line_text as "In stream field". You can leave "Out stream field" empty in case you want the output to go to line_text (so the original value will be overwritten with the new one). If you want to keep the old value, define a new data point name here.
For this example we will use a regular expression, so make sure you set "use RegEx" to Y. We will use following expression:

^([a-zA-z]+)(\s{1})(.*)$

This expression makes use of capturing groups (defined by brackets), which will allow us to specify the part that we want to keep. What this regular expression says is basically the following: the value should start with (indicated by ^) any combination of upper or smaller letters (one or more of them indicated by +). This is followed by one space character (indicated by \s{1}) and then basically any character or none can follow. Probably we could even improve this expression, but for now it does what it is supposed to do.

In the "Replace with" field we enter $3, which basically tells Kettle that we want to keep the part that is retrieved by the 3rd capturing group.

Finally, we output the results to a text file. The output looks like this:

Some info about using regular expressions you can find here and here and here and ...
You can find the XML code for the transformation here.

If you are creating extremely long regular expressions, named capturing groups are quite useful. As the name indicates, you can assign a name to the capturing group and use this name later on in the back reference. Currently, Kettle doesn't support named capturing groups, but I am sure a future version will cover this.

Posted in | No comments

Monday, 1 November 2010

PDI Kettle Plugins

Posted on 14:18 by Unknown

Pentaho Data Integration Plugins

Agile BI

This extremly useful plugin can be downloaded from the Pentaho website (plugin for PDI 4, plugin for PDI 4.1 RC1). [Due to the fact that this plugin is not open source, PDI doesn't have it installed by default. Pentaho was so kind to make it available for the community version for free.]

Unzip the file into the data-integration/plugins/spoon directory. It will create a folder in there named agile-bi. Start spoon and the new capabilities will automatically be available.
Once you have done this, fire up Spoon, create a transformation which outputs the data to a completely denormalized table.
Once there is data in this table, right click on the table output step, choose Model.
In the model view, you can click the "Auto populates model with default dimensions and measures" icon. If this doesn't do a decent job generating your model, you can always change it manually.
Once you have properly prepared your model, save it and return to the data integration perspective.
Right click on the table output step again and choose Visualize > Analyzer. In this perspective you can fully dig into your data and discover any problems. For example, my data set has a country data point. The values are supposed to be full country names, but I realize that somehow in my data "AR" shows up instead of "Argentina". So I can go back to the data integration perspective, do the necessary changes to the transformation, save it, run it again, go back to the Analyzer, refresh the data and I can see that now all my country values are valid. This is an absolute time saver and very efficient approach to quality checking your data.

There are a couple of outer things you can do with this plugin as well, i.e. create a report with the wizard known from the Report Designer.

Kettle Franchising Factory

This is a very interesting project to give you a framework for ETL development. The project description reads as follows:

The Kettle Franchising Factory (KFF) adds on top of the existing kettle platform the necessary tools to open multiple data integration restaurants in a rapid, flexible and organised way. KFF allows you to deploy a large series of data integration solutions (multi-customer, multi-solution) in a fully standardized way.

KFF is composed of:
Kettle plugins
re-usable transformations/jobs
logging/scheduling framework
standards
naming conventions
best practices for set-up
directory structures

I hope that we see further development on it. You can find a presentation about it here and download it here.

Kettle Cookbook

Another very promising project initiated by Roland Bouman (the co-author of the excellent Pentahos Solution books): This is not really a plugin, but a job that auto generates a documentation based on the description you added to your steps, jobs, etc. So there are no excuses any more not to create a documentation! Have look here for more info.

Pentaho Report Output Step

Matt Casters made this step available: It allows you to pass data points to a PRPT (Pentaho Report). You can specify where the report template is located, to which directory and in which format the report should be outputted and also specify report parameters. You can use this in simple scenarios where you just want to output a single report and an more complex fashion for report bursting i.e..

Please find more information about it here.

Excel 2007 XLSX Output Step

Slawo was so kind to provide this step which will be of much use if you are mainly working with newer versions of MS Office. You can find more info here.

As you see, plugins can add some interesting features to Kettle which facilitate our work enormously!

Posted in | No comments

Tuesday, 28 September 2010

Mondrian MDX and Schema Validation Difference PDR and Schema Workbench

Posted on 05:47 by Unknown

PDR 3.6.1 doesn't seem to behave the same way in processing MDX queries and XML Schemas than other tools (Schema Workbench, JPivot, etc).

In example, CurrentDateMember([Date.Weekly Calendar], '[yyyy]\.[ww]').Lag(3.0) works perfectly in Schema Workbench, but not in PDR. PDR requires a more accurate writing style: CurrentDateMember([Date.Weekly Calendar], '["Date.Weekly Calendar"]\.[yyyy]\.[ww]').Lag(3.0). So whereas Schema Workbench and JPivot can figure out where to find year and week, PDR needs the exact path.

Another example: The formula of a calculated member looks like this in the Schema: [Subscription Base Revenue Share Deals (Actual)]/[Subscription Base (Actual)]. If you use JPivot or Schema Workbench, everything will work perfectly. But PDR requires the formula to be like this: [Measures].[Subscription Base Revenue Share Deals (Actual)]/[Measures].[Subscription Base (Actual)].

For both example, the syntax that PDR requires is the really accurate one. I would really appreciate if Schema Workbench (version 3.2.0.13661) would have the same strict validation as well. This would avoid a lot of confusion as to why we get so many more error messages in PDR.

When discussing this topic on the Mondrian developer mailing list, Julian Hyde commented the following:
"It looks like PRD is using mondrian to validate formulas. I suspect that it is an earlier version of Mondrian, which had weaker validation rules. I don't recall why we made the change, but people will log bugs that MDX succeeds in SSAS and fails in mondrian, and we will (rightly) change mondrian.

Qualifying members with their dimension & hierarchy name is recommended. Mondrian can resolve members faster if you do.

We can't give an error if people don't qualify member names. But should we emit a warning if someone writes [Store Sales] / [Store Cost] in a formula? I don't have a strong opinion either way."

As Thomas pointed out in the comment below, have a look at the mondrian.properties file located in the PDR folder report-designer\resources. You can find various settings there like this one:

mondrian.olap.elements.NeedDimensionPrefix=true

This seems to be the one that stopped my "not so accurate" MDX queries to run. I do not recommend changing this setting though, but advise to write precise MDX queries and make sure that the calculated members in your Schema have to complete reference as well.

UPDATE 2010/10/04:

Make sure the PDR mondrian.properties file is set up the same way as the BI Server and Schema Workbench ones.

Open prd-ce-3.6.1-stable\report-designer\resources\mondrian.properties and look for the settings shown below (these ones, especially the first one, will have a major impact):

###############################################################################

# Property determines if elements of dimension (levels, hierarchies, members)

# need to be prefixed with dimension name in MDX query.

# For example when the property is true, the following queries

# will error out. The same queries will work when this property

# is set to false.

# * select {[M]} on 0 from sales

# * select {[USA]} on 0 from sales

# * select {[USA].[CA].[Santa Monica]} on 0 from sales

# When the property is set to true, any query where elements are

# prefixed with dimension name as below will work

# * select {[Gender].[F]} on 0 from sales

# * select {[Customers].[Santa Monica]} on 0 from sales

# Please note that this property does not govern the behaviour where in

# * [Gender].[M]

# is resolved into a fully qualified

# * [Gender].[All Gender].[M]

# In a scenario where the schema is very large and dimensions have large

# number of members a MDX query that has a invalid member in it will cause

# mondrian to to go through all the dimensions, levels, hierarchies, members

# and properties trying to resolve the element name. This behaviour consumes

# considerable time and resources on the server. Setting this property to

# true will make it fail fast in a scenario where it is desirable

mondrian.olap.elements.NeedDimensionPrefix=true

Don't change this one: It's important that your Schema and MDX has properly defined syntax.

###############################################################################

# Boolean property indicating whether errors related to non-existent members

# should be ignored during schema load. If so, the non-existent member is

# treated as a null member.

mondrian.rolap.ignoreInvalidMembers=true

Change to: mondrian.rolap.ignoreInvalidMembers=false

###############################################################################

# Boolean property indicating whether errors related to non-existent members

# should be ignored during query validation. If so, the non-existent member is

# treated as a null member.

mondrian.rolap.ignoreInvalidMembersDuringQuery=true

Change to: mondrian.rolap.ignoreInvalidMembersDuringQuery=false

Now open psw-ce-3.2.0.13661\schema-workbench\mondrian\properties and add the above highlighted properties from the PRD properties file (if these properties already exist, amend them so that they are exactly set the same way).

In my example mondrian.rolap.ignoreInvalidMembers was already in the Schema Workbench mondrian.properties file and properly set to "false" (so no changes necessary). I also added the two other properties.

Posted in "Pentaho Report Designer", "Schema Workbench", Mondrian | No comments

Tuesday, 17 August 2010

Pentaho Solutions - The Pentaho bible

Posted on 12:59 by Unknown

I finally finished reading the 600 pages book "Pentaho Solutions" (by Roland Bouman and Jos van Dongen). I don't want to repeat what all the other reviews are clearly stating in detail, let me just say this: If you are new to the Pentaho BI world, this is the book you should buy to get to know everything about all the components that make up the Pentaho BI suite. The book doesn't stop there, but also gives you a primer on BI and data warehouse design in general, a SQL, OLAP and MDX intoduction and much more ... it's really all that you will not need to set up proper BI solutions! It is also a great resource for people who have been working with Pentaho for a bit longer.
I do hope that Roland and Jos will keep this Pentaho bible up-to-date with all future major versions of the Pentaho BI Suite.

After reading this book I am really looking forward to the forthcoming book "Pentaho Kettle Solutions" (also written by Roland and Jos in cooperation with Matt Casters).

Posted in | No comments

Sunday, 13 June 2010

Posted on 04:45 by Unknown

PDI: Clone your data in Kettle! (Video Tutorial)

In this tutorial we look at the cloning feature of Pentaho Data Integration/Kettle and use the power of the JavaScript step. Please watch the video here:

Part 1:

Part 2:

Part 3:

The files for this tutorial: Download the budget data here and the PDI/Kettle file here.

You can find the video's also on YouTube:

Video Part 1: click here

Video Part 2: click here

Video Part 3: click here

Please find below some more information:

Code for 1st JavaScript step:

function daysInMonth(iMonth, iYear)
{
iMonth=iMonth-1;
return 32 - new Date(iYear, iMonth, 32).getDate();
}

var days_in_month=daysInMonth(month,year);
var days_loop_nr=days_in_month-1;

var year=2010;

var bdgt_unsubs=bdgt_sub_base_start+bdgt_new_subs-bdgt_sub_base_end;

I found a nifty function (”daysInMonth”) on the internet that calculates the days of a particular month (we are neither JavaScript heros nor do we have time). This function provides us with the required number of days in a month, which we assign to the days_in_month variable. Important thing to remember is that we already have one row of data for each month, hence we have to deduct 1 from the days_in_month. We store this value in the “days_loop_nr” variable which we will use for the clone step.

We also add a default year and calculate how many users unsubscribed from our service (”bdgt_unsubs”). We need this figure as we want to calculate later on a daily figure for the subscription base (users subscribe and unsubscribe all the time).

Code for the 2nd JavaScript step:

var day_of_month;
// used to hold value of previous row:
var year_month_old;
var bdgt_sub_base_start_old;
var bdgt_new_subs_daily_old;
var bdgt_unsubs_daily_old;

//define day of month
var year_month=(year+"")+(month+"");

if(year_month!=year_month_old){
day_of_month=1;
}
else {
day_of_month=day_of_month+1;
}

//calculate the daily sub base at the start of the day
if(month==1 && day_of_month==1){
var bdgt_sub_base_start_daily=bdgt_sub_base_start;
}
else {
var bdgt_sub_base_start_daily=bdgt_sub_base_start_daily_old+bdgt_new_subs_daily_old-bdgt_unsubs_daily_old;
}

//keep value of current fields for next iteration
//assign them to a new variable so that they are available in the next iteration
//it is important that you only save the new value of year_month
//here as you want to work with the old value in the if statement
//above
year_month_old=year_month;
bdgt_sub_base_start_daily_old=bdgt_sub_base_start_daily;
bdgt_new_subs_daily_old=bdgt_new_subs_daily;
bdgt_unsubs_daily_old=bdgt_unsubs_daily;

We declare some variables first. Then we create the day number for each row: We use an if condition therefore. We have to start with 1 when a new month begins and then increase the number by 1.

It is important to understand the condition:
if(year_month!=year_month_old)

For the very first row of our dataset, the variable year_month_old won’t have a value. Hence, if(year_month!=year_month_old) will be true.

At the very end of the script we create a new variable call year_month_old which stores the current value of year_month for the next iteration. We only create this variable at the end of the script, so that we can use it in the next iteration for the condition before we overwrite it again.

A variable will be only overwritten if a new value is available.

So imagine we are in the second row of our dataset, the if(year_month!=year_month_old) condition will look like this: if(201001!=201001), hence the script will increase the value of the day by 1. Once the month changes (i.e. if(201002!=201001)) then we have to start with 1 again.

That’s quite a nice solution. Again the javascript code might not be the best one, but it works. Please feel free to improve it.

We also have figures that we cannot just divide by the number of days in a month. The subscription base has to be calculated in a more complex way (well, not that complex). The calculation uses the same concept as above: saving the field value of the previous row in a new variable so that you can use it again in the current row.

Posted in | No comments

Friday, 4 June 2010

PDI: Full Outer Joins

Posted on 00:53 by Unknown

Pentaho Data Integration: Full Outer Joins

Creating full outer joins in Pentaho Data Integartion (PDI aka Kettle) is a fairly straight forward approach ... let’s have a look at it:

In this example we will be looking at data of an online acquisition process. We are facing the problem that the tracking hasn’t been implemented correctly, hence it can happen, that in a few cases we have pin insertions but no page impressions.

This is a good example to use a full outer join, as we will preserve all the data and can demonstrate to management that there is indeed a problem with the tracking setup.

Imagine we have two simple data sets:

Page Impressions

date
service
mk
page impressions
2010-05-01
serv1
123
231
2010-05-01
serv2
443
2
2010-05-01
serv3
234
33

PIN Insertions

date
service
mk
pin insertions
2010-05-01
serv5
33
231
2010-05-01
serv2
443
1
2010-05-01
serv1
123
55

Ok, let’s get started: Copy the above data sets into two separate Excel spreadsheets, save the file and fire up PDI/Kettle.

Create a new transformation. First thing to do in Kettle is to drag and drop an Excel Input step (you can find in in the Input folder in the Design tab on the left hand side). Double click on it. Click on “Browse” and choose the file that we just saved. Click “Open”. The file dialog closes, then click on “Add” (next to “Browse”).

Next click on the “!Sheets” tab followed by clicking on “Get Sheetnames”. Mark the sheet name that contains the page impressions, then click “>” to add it to your selections. It should look like this now:

Click “ok”.

Go to the “!Fields” tab and click on “Get fields from header row(s) ...”:

Now we are ready to view the data: Click on “Preview Rows” and check if everything is alright.

Close the preview window and click “ok” to close the step window.

Now mark the “Excel Input” step and copy and paste it. Open the new step and change it so that it imports the other spreadsheet. Make sure that you change the sheet name and the field names! Preview the data to see if everything is ok.

We now add a sort step for each of the Excel Input steps. Connect the Excel Input steps and the Sort Steps with a hop by holding down SHIFT and drawing a line from one step to the other.

Open the Sort Step and click on “Get fields ...”. Mark the row that contains “page impressions” or “pin insertions” and press DELETE (we obviously don’t want to sort by the measure). Click “Ok”. Do exactly the same for the other Sort Step.

As we have now sorted our data sets, we can join them. Drag and drop a Merge Join Step onto the canvas and connect both Sort Steps to it.

Open the Merge Join Steps, set the first step to “Sort Rows” and the second step to “Sort Rows 2” and join type to “Full Outer”. As we use a full outer join, it isn’t really important which step you mention in first step or second step.

Now click on “get key fields” and delete the measures. It should look like this now:

Now it is a good idea to preview the data again. Mark the Merge Join step and press F10. In the “Transformation Debug Dialog” click on “Quick Launch”.

You see that Kettle has successfully joined the data sets. As we chose “Full Outer” as join type, we can see all the data from each input data set. You will also realize, that both data sets had field names in common, so Kettle added “_1” to the field names of the second data set to avoid any confusion.

So now let’s clean up this result set a bit, to make it look nicer. We will add new fields called date_king, service_king and mk_king, and will use the values date or date_1 ( and so on) to populate it. The goal is to have for each row a proper date, service and mk value.
Add a Formula step and connect the Merge Join step to it. Fill out the Formula step as shown in the screenshot below:

Here the formulas to copy:
IF(ISBLANK([date]);[date_1];[date])
IF(ISBLANK([service]);[service_1];[service])
IF(ISBLANK([mk]);[mk_1];[mk])

What we do here is to check if one of the fields is empty and if yes, we use the value of the other one.

Click “Ok” to close the step configuration.
Now do a quick preview by clicking on the step and pressing F10.

As we don’t want to have all the unnecessary data in our final result set, add a “Select values ...” step, create the hob and open the configuration window of this step. Click on “Get fields to select” and delete all fields except the ones shown in the screenshot below:

We also rename some of the fields and put them in a different order. Ordering works like this: Mark the row, press ALT and use the arrows to move the row to another position.

Press F10 again and now the result set should look like this:

Now everything looks pretty and nice. If you want to export the result set somewhere, feel free now to add one of the many export steps.

This is how to create an Outer Join in Kettle in a nutshell. I think you will agree with me in saying that this is fairly easy to achieve.

Posted in | No comments

Sunday, 9 May 2010

Pentaho 3.2 Data Integration: Beginne...

Posted on 11:31 by Unknown

Pentaho 3.2 Data Integration: Beginner's Guide

I received a free copy of this book from the publisher Pakt for this review. You can buy the book directly from Packt as a print or download a PDF version of it.

http://www.packtpub.com/pentaho-3-2-data-integration-beginners-guide/book?utm_source=diethardsteiner.blogspot.com&utm_medium=bookrev&utm_content=blog&utm_campaign=mdb_003011

I first got to know about Kettle/PDI back in 2008. I was looking for an open source tool that would allow me to introduce better BI for my company: A tool that would allow me to focus on working with the actual data instead of trying to build a tool myself and having a minimum amount of time left to deal with the data.

In all honesty I have to say that Kettle/PDI had a huge and positive impact on the way I work. At the time when I was starting with Kettle any documentation was scarce. I did a lot of trail and error, forum search, google search etc. This approach asked for a lot of patience and time.

I am very pleased that there is finally a book about Kettle/PDI available, especially for all those people that are about to start working with Kettle/PDI. In a nutshell, this book will give you all the information that you need to get started with Kettle/PDI quickly and efficiently.

So let's have a look at the book's content:

In the foreword we learn about the history of the Kettle project. Most of this was quite new to me. Did you know that Kettle actually was an acronym for KDE Extraction Transformation and Loading Environment? Anyways, this is just one of the exciting facts.

The first chapter describes the purpose of PDI, its components, the UI, how to install it and a very simple transformation. Moreover, the last part tells you step by step how to install MySQL on Windows and Ubuntu.

It's just what you want to know when you touch PDI for the first time. The instructions are easy to follow and understand and should help you to get started in no time. I honestly quite like the structure of the book: Whenever you are learning something new, it is followed by a section that just recaps everything. So it will help you to remember everything much easier.

Maria focuses on using PDI with files instead of the repository, but she offers a description on how to work with the repository in the appendix of the book.

Chapter 2: You will learn how to reading data from a text file and how to handle header and footer lines. Next up is a description of the "Select values ..." step which allows you to apply special formatting to the input fields, select the fields that you want to keep or remove. You will create a transformation that reads multiple text fields at once by using regular expressions in the text input step. This is followed by a troubleshooting section that describes all kind of problems that might happen in the setup and how to solve them.

The last step of the sample transformation is the text file output step.

Then you improve this transformation by adding the "Get system info" step, which will allow you to pass parameters to this transformation on execution. This is followed by a detailed description of the data types (I wish I had all this formatting info when I started so easily at hand). And then it even gets more exciting: Maria talks you through the setup of a batch process (scheduling a Kettle transformation).

The last part of this chapter describes how to read XML files with the XML file input step. There is a short description of XPath which should help you to get going with this particular step easily.

Chapter 3 walks you through the basic data manipulation steps. You set up a transformation that makes use of the calculator step (loads of fancy calculation examples here). For more complicated formulas Maria also introduces the formula step. Next in line are the Sort By and Group By step to create some summaries.

In the next transformation you import a text file and use the Split field to rows step. You then apply the filter step on the output to get a subset of the data. Maria demonstrates various example on how to use the filter step effectively.

At the end of the chapter you learn how to lookup data by using the "Stream Lookup" step. Maria describes very well how this step works (even visualizing the concept). So it should be really easy for everybody to understand the concept.

Chapter 4 is all about controlling the flow of data: You learn how to split the data stream by distributing or copying the data to two or more steps (this is based on a good example: You start with a task list that contains records for various people. You then distribute the tasks to different output fields for each of these people). Maria explains properly how "distribute" and "copy" work. The concept is very easy to understand following her examples.

In another example Maria demonstrates how you can use the filter step to send the data to different steps based on a condition. In some cases, the filter step will not be enough, hence Maria also introduces the "Switch/Case" step that you can use to create more complex conditions for your data flow.

Finally Maria tells you all about merging streams and which approach/step best to use in which scenario.

In Chapter 5 it gets really interesting: Maria walks you through the JavaScript step. In the first example you use the JavaScript step for complex calculations. Maria provides an overview of the available functions (String, Numeric, Date, Logic and Special functions) that you can use to quickly create your scripts by dragging and dropping them onto the canvas.

In the following example you use the JavaScript step to modify existing data and add new fields. You also learn how to test your code from within this step. Next up (and very interesting) Maria tells you how to create special start and end scripts (which are only executed one time as opposed to the normal script which is executed for every input row). We then learn how to use the transformation constants (SKIP_TRANSFORMATION, CONTINUE_TRANSFORMATION, etc) to control what happens to the rows (very impressive!).

In the last example of the chapter you use the JavaScript step to transform a unstructured text file. This chapter offered quite some in-depth information and I have to say that there were actually some things that I didn't know.

In the real world you will not always get the dataset structure in the way that you need it for processing. Hence, chapter 6 tells you how you can normalise and denormalise datasets. I have to say that Maria took really huge effort in visualizing how these processes work. Hence, this really helps to understand the theory behind these processes. Maria also provides two good examples that you work through.

In the last example of this chapter you create a date dimension (very useful, as everyone of us will have to create on at some point).

Validating data and handling errors is the focus of chapter 7. This is quite an important topic, as when you automate transformation, you will have to find a way on how to deal with errors (so that they don't crash the transformation). Writing erros to the log, aborting a transformation, fixing captured errors and validating data are some of the steps you go through.

Chapter 8 is focusing on importing data from databases. Readers with no SQL experience will find a section covering the basics of SQL. You will work with both the Hypersonic database and MySQL. Moreover Maria introduces you to the Pentaho sample database called "Steel Wheels", which you use for the first example.

You learn how to set up a connection to the database and how to explore it. You will use the "Table Input" to read from the database as well as the "Table Output" step to export the data to a database. Maria also describes how to parameterize SQL queries, which you will definitely need to do at some point in real world scenarios.

In next tutorials you use the Insert/Update step as well as the Delete step to work with tables on the database.

In chapter 9 you learn about more advance database topics: Maria gives an introduction on data modelling, so you will soon know what fact tables, dimensions and star schemas are. You use various steps to lookup data from the database (i.e. Database lookup step, Combination lookup/update, etc). You learn how to load slowly changing dimensions Type 1, 2 and 3. All these topics are excellently illustrated, so it's really easy to follow, even for a person which never heard about these topics before.

Chapter 10 is all about creating jobs. You start off by creating a simple job and later learn more about on how to use parameters and arguments in a job, running jobs from the terminal window and how to run job entries under conditions.

In chapter 11 you learn how to improve your processes by using variables, subtransformations (very interesting topic!), transferring data between transformations, nesting jobs and creating a loop process. These are all more complex topics which Maria managed to illustrate excellently.

Chapter 12 is the last practical chapter: You develop and load a datamart. I would consider this a very essential chapter if you want to learn something about data warehousing. The last chapter 13 gives you some ideas on how to take it even further (Plugins, Carte, PDI as process action, etc) with Kettle/PDI.

In the appendix you also find a section that tells you all about working with repositories, pan and kitchen, a quick reference guide to steps and job entries and the new features in Kettle 4.

Conclusion:

This book certainly fills a gap: It is the first book on the market that focuses solely on PDI. From my point of view, Maria's book is excellent for anyone who wants to start working with Kettle and even those ones that are on an intermediate level. This book takes a very practical approach: The book is full of interesting tutorials/examples (you can download the data/code from the Pakt website), which is probably the best way to learn about something new. Maria also made a huge effort on illustrating the more complex topics, which helps the reader to understand the step/process easily.

All in all, I can only recommend this book. It is the easiest way to start with PDI/Kettle and you will be able to create complex transformations/jobs in no time!

You can buy the book on the Pakt website.

Posted in | No comments

Community Server

Friday, 12 November 2010

Using regular expressions with Pentah...

Using regular expressions with Pentaho Data Integration (Kettle)

Monday, 1 November 2010

PDI Kettle Plugins

Pentaho Data Integration Plugins

Agile BI

Kettle Franchising Factory

Kettle Cookbook

Pentaho Report Output Step

Excel 2007 XLSX Output Step

Tuesday, 28 September 2010

Mondrian MDX and Schema Validation Difference PDR and Schema Workbench

Tuesday, 17 August 2010

Pentaho Solutions - The Pentaho bible

Sunday, 13 June 2010

Friday, 4 June 2010

PDI: Full Outer Joins

Sunday, 9 May 2010

Pentaho 3.2 Data Integration: Beginne...

Popular Posts

Categories

Blog Archive

About Me