Sunday, 9 May 2010

Pentaho 3.2 Data Integration: Beginne...

Posted on 11:31 by Unknown

Pentaho 3.2 Data Integration: Beginner's Guide

I received a free copy of this book from the publisher Pakt for this review. You can buy the book directly from Packt as a print or download a PDF version of it.

http://www.packtpub.com/pentaho-3-2-data-integration-beginners-guide/book?utm_source=diethardsteiner.blogspot.com&utm_medium=bookrev&utm_content=blog&utm_campaign=mdb_003011

I first got to know about Kettle/PDI back in 2008. I was looking for an open source tool that would allow me to introduce better BI for my company: A tool that would allow me to focus on working with the actual data instead of trying to build a tool myself and having a minimum amount of time left to deal with the data.

In all honesty I have to say that Kettle/PDI had a huge and positive impact on the way I work. At the time when I was starting with Kettle any documentation was scarce. I did a lot of trail and error, forum search, google search etc. This approach asked for a lot of patience and time.

I am very pleased that there is finally a book about Kettle/PDI available, especially for all those people that are about to start working with Kettle/PDI. In a nutshell, this book will give you all the information that you need to get started with Kettle/PDI quickly and efficiently.

So let's have a look at the book's content:

In the foreword we learn about the history of the Kettle project. Most of this was quite new to me. Did you know that Kettle actually was an acronym for KDE Extraction Transformation and Loading Environment? Anyways, this is just one of the exciting facts.

The first chapter describes the purpose of PDI, its components, the UI, how to install it and a very simple transformation. Moreover, the last part tells you step by step how to install MySQL on Windows and Ubuntu.

It's just what you want to know when you touch PDI for the first time. The instructions are easy to follow and understand and should help you to get started in no time. I honestly quite like the structure of the book: Whenever you are learning something new, it is followed by a section that just recaps everything. So it will help you to remember everything much easier.

Maria focuses on using PDI with files instead of the repository, but she offers a description on how to work with the repository in the appendix of the book.

Chapter 2: You will learn how to reading data from a text file and how to handle header and footer lines. Next up is a description of the "Select values ..." step which allows you to apply special formatting to the input fields, select the fields that you want to keep or remove. You will create a transformation that reads multiple text fields at once by using regular expressions in the text input step. This is followed by a troubleshooting section that describes all kind of problems that might happen in the setup and how to solve them.

The last step of the sample transformation is the text file output step.

Then you improve this transformation by adding the "Get system info" step, which will allow you to pass parameters to this transformation on execution. This is followed by a detailed description of the data types (I wish I had all this formatting info when I started so easily at hand). And then it even gets more exciting: Maria talks you through the setup of a batch process (scheduling a Kettle transformation).

The last part of this chapter describes how to read XML files with the XML file input step. There is a short description of XPath which should help you to get going with this particular step easily.

Chapter 3 walks you through the basic data manipulation steps. You set up a transformation that makes use of the calculator step (loads of fancy calculation examples here). For more complicated formulas Maria also introduces the formula step. Next in line are the Sort By and Group By step to create some summaries.

In the next transformation you import a text file and use the Split field to rows step. You then apply the filter step on the output to get a subset of the data. Maria demonstrates various example on how to use the filter step effectively.

At the end of the chapter you learn how to lookup data by using the "Stream Lookup" step. Maria describes very well how this step works (even visualizing the concept). So it should be really easy for everybody to understand the concept.

Chapter 4 is all about controlling the flow of data: You learn how to split the data stream by distributing or copying the data to two or more steps (this is based on a good example: You start with a task list that contains records for various people. You then distribute the tasks to different output fields for each of these people). Maria explains properly how "distribute" and "copy" work. The concept is very easy to understand following her examples.

In another example Maria demonstrates how you can use the filter step to send the data to different steps based on a condition. In some cases, the filter step will not be enough, hence Maria also introduces the "Switch/Case" step that you can use to create more complex conditions for your data flow.

Finally Maria tells you all about merging streams and which approach/step best to use in which scenario.

In Chapter 5 it gets really interesting: Maria walks you through the JavaScript step. In the first example you use the JavaScript step for complex calculations. Maria provides an overview of the available functions (String, Numeric, Date, Logic and Special functions) that you can use to quickly create your scripts by dragging and dropping them onto the canvas.

In the following example you use the JavaScript step to modify existing data and add new fields. You also learn how to test your code from within this step. Next up (and very interesting) Maria tells you how to create special start and end scripts (which are only executed one time as opposed to the normal script which is executed for every input row). We then learn how to use the transformation constants (SKIP_TRANSFORMATION, CONTINUE_TRANSFORMATION, etc) to control what happens to the rows (very impressive!).

In the last example of the chapter you use the JavaScript step to transform a unstructured text file. This chapter offered quite some in-depth information and I have to say that there were actually some things that I didn't know.

In the real world you will not always get the dataset structure in the way that you need it for processing. Hence, chapter 6 tells you how you can normalise and denormalise datasets. I have to say that Maria took really huge effort in visualizing how these processes work. Hence, this really helps to understand the theory behind these processes. Maria also provides two good examples that you work through.

In the last example of this chapter you create a date dimension (very useful, as everyone of us will have to create on at some point).

Validating data and handling errors is the focus of chapter 7. This is quite an important topic, as when you automate transformation, you will have to find a way on how to deal with errors (so that they don't crash the transformation). Writing erros to the log, aborting a transformation, fixing captured errors and validating data are some of the steps you go through.

Chapter 8 is focusing on importing data from databases. Readers with no SQL experience will find a section covering the basics of SQL. You will work with both the Hypersonic database and MySQL. Moreover Maria introduces you to the Pentaho sample database called "Steel Wheels", which you use for the first example.

You learn how to set up a connection to the database and how to explore it. You will use the "Table Input" to read from the database as well as the "Table Output" step to export the data to a database. Maria also describes how to parameterize SQL queries, which you will definitely need to do at some point in real world scenarios.

In next tutorials you use the Insert/Update step as well as the Delete step to work with tables on the database.

In chapter 9 you learn about more advance database topics: Maria gives an introduction on data modelling, so you will soon know what fact tables, dimensions and star schemas are. You use various steps to lookup data from the database (i.e. Database lookup step, Combination lookup/update, etc). You learn how to load slowly changing dimensions Type 1, 2 and 3. All these topics are excellently illustrated, so it's really easy to follow, even for a person which never heard about these topics before.

Chapter 10 is all about creating jobs. You start off by creating a simple job and later learn more about on how to use parameters and arguments in a job, running jobs from the terminal window and how to run job entries under conditions.

In chapter 11 you learn how to improve your processes by using variables, subtransformations (very interesting topic!), transferring data between transformations, nesting jobs and creating a loop process. These are all more complex topics which Maria managed to illustrate excellently.

Chapter 12 is the last practical chapter: You develop and load a datamart. I would consider this a very essential chapter if you want to learn something about data warehousing. The last chapter 13 gives you some ideas on how to take it even further (Plugins, Carte, PDI as process action, etc) with Kettle/PDI.

In the appendix you also find a section that tells you all about working with repositories, pan and kitchen, a quick reference guide to steps and job entries and the new features in Kettle 4.

Conclusion:

This book certainly fills a gap: It is the first book on the market that focuses solely on PDI. From my point of view, Maria's book is excellent for anyone who wants to start working with Kettle and even those ones that are on an intermediate level. This book takes a very practical approach: The book is full of interesting tutorials/examples (you can download the data/code from the Pakt website), which is probably the best way to learn about something new. Maria also made a huge effort on illustrating the more complex topics, which helps the reader to understand the step/process easily.

All in all, I can only recommend this book. It is the easiest way to start with PDI/Kettle and you will be able to create complex transformations/jobs in no time!

You can buy the book on the Pakt website.

Posted in | No comments

Wednesday, 21 April 2010

Pentaho Community Data Access (CDA)

Posted on 07:38 by Unknown

Pedro Alves did some impressive work on the new CDA (Community Data Access):

CDA allows you to access any of the various Pentaho data sources without worrying about the details. It can be used as a standalone plugin on the Pentaho BI server that can output the result in several formats or it can be used in combination with the Dashboard Editor and the CDF.

It basically covers any source that you can use within PRD (Pentaho Report Designer). Additionally it has a security layer that prevents code injection.

Moreover it has a caching layer (only Mondrian provides caching out of the box).

CDA also offers unions, joins (full outer join), column selection, formula additon and column renaming. Following output formats are available: JSON, XML, CSV and XSL.

CDA uses XML files to define the access data, which can be created/edited in the CDA Editor (The Editor doesn't work in IE). Results can be previewed via the CDA Previewer.

A CDA file consists two parts: the connection details and the query itself.

By using JNDI you can use the connections that you set up on the Pentaho Admin Panel.

CDA can be used then as input for the Dashboard Editor.

For all charts that are displayed within the CDA plugin pages, CDF is actually called to generate them.

Pedro is working on releasing a new version on CDF with CDA completely integrated.

Watch the webcast on for more details.

Posted in | No comments

Thursday, 15 April 2010

Review coming for "Pentaho 3.2 Data Integration: Beginner's Guide"

Posted on 07:06 by Unknown

I am quite excited to let you know that Packt will kindly send me a copy of their new book "Pentaho 3.2 Data Integration: Beginner's Guide" (by María Carina Roldán) to review it. I should have it in my post box in the next few days and then I'll focus on it. So stay tuned!

Posted in | No comments

Wednesday, 14 April 2010

New books on Pentaho Data Integration (Kettle)

Posted on 01:55 by Unknown

Good news for everybody who wants to get started with Pentaho Data Integration: 2 books are in the pipeline.

The first one "Pentaho 3.2 Data Integration: Beginner's Guide" by María Carina Roldán will be published in the next few days. I'll try to get a copy of it. The TOC promises a quite interesting book!

Later this year "Pentaho Kettle Solutions" will be available. This books will be based on the forthcoming PDI 4. It's written by the same team (Roland Bouman, Jos van Dongen) that brought you the excellent "Pentaho Solutions".

Posted in | No comments

Thursday, 4 February 2010

MySQL breakfast presentation on data warehousing

Posted on 08:17 by Unknown

Today I attended a Sun/MySQL presentation in London that focused on data warehousing with MySQL, Infobright and Talend. Overall the presentation gave quite a good overview. The concept of packaging and knowledge layer in Infobright seem quite interesting; and not to worry about indexing is another nice point. I have to look how it compares to Lucid DB or Monet DB at some point when I find some time. I might give the Infobright Community Edition a try.

It was also interesting to see another open source ETL tool. From my point of view the interface of Talend doesn't look as user friendly as Pentaho Kettle, but maybe that's down to the fact that I have been working with Kettle for a long time now. In essence, Kettle will still be my ETL tool of choice.

Posted in | No comments

Saturday, 23 January 2010

How to create a loop in Pentaho Kettle

Posted on 06:37 by Unknown

I finished my first ever video tutorial! This video will demonstrate you how easy it is to create a loop in Pentaho Kettle. Enjoy!

Posted in "Pentaho Kettle", Loop, Pentaho, Tutorial | No comments

Thursday, 10 December 2009

Rethinking the Pentaho Report Designer Layout

Posted on 14:19 by Unknown

Rethinking the Pentaho Report Designer Layout

The Pentaho Report Designer (PRD) has evolved to a very feature-rich product. In this article I want to point out one problem that I still have with this product from a usability point of view.

Imagine following scenario: You are about to create a new report for your CEO. In a nutshell the report should have first a summary cross-tab with the essential KPIs and then below some more detailed product data in a standard table (let's keep it simple).

Now currently you would create the product report/table in the body of your main report and then you would add a subreport to the header, which would reference the cross-tab which lives in a separate reporting file. Now that might all have a technical reason why PRD is set up like this, but from a usability point of view, it is just not an ideal solution. This is a simple example, imaging if we would have to include more subreports in the main report. Another problem with subreports is that you don't really see in the main report if the layout of the subreport is in harmony with the rest of the main report. You have to execute the report in order to see this, then go back to the subreport, execute it again, check if it is fine and if not you have to go back again.

Now it is all easy to sit here and write these lines if you are not involved in developing PRD. I am writing this article as a constructive criticism because I am a big fan of PDR and want to see it become the best report designer. For me, from a user perspective, it would be way easier if PRD offered report type elements that you could drag and drop onto the canvas, just as you would do it now in PRD 3.5 with labels, charts etc. This way you would not be force straight away to follow a specific given structure (Imagine the case where I would only have one chart in a main report in the header and the reporting body/details would be completely unused).

So let's say we would have report elements like "Crosstab Report","Classic table report", etc that you just drag and drop onto the canvas when you need them. Within those elements you define all the necessary settings and you create all the necessary other elements. We are doing this in the same window (we don't have to go to a different window). We control all our data connections in one place and we see all our design in one place.

Over the time Pentaho might introduce various other report type elements, that you then can just drag and drop on the canvas as well. Overall I think that this approach would facilitate designing a report. PRD 3.5 was a huge step forward and I am positive that we will see great new features with the next versions.

Posted in | No comments

Community Server

Sunday, 9 May 2010

Pentaho 3.2 Data Integration: Beginne...

Wednesday, 21 April 2010

Pentaho Community Data Access (CDA)

Thursday, 15 April 2010

Review coming for "Pentaho 3.2 Data Integration: Beginner's Guide"

Wednesday, 14 April 2010

New books on Pentaho Data Integration (Kettle)

Thursday, 4 February 2010

MySQL breakfast presentation on data warehousing

Saturday, 23 January 2010

How to create a loop in Pentaho Kettle

Thursday, 10 December 2009

Rethinking the Pentaho Report Designer Layout

Rethinking the Pentaho Report Designer Layout

Popular Posts

Categories

Blog Archive

About Me