Community Server

Friday, 24 May 2013

Pentaho Report Designer: How to show the parameter display name in your report when it is different from the parameter value

Posted on 10:47 by Unknown

One of my blog's readers just asked me quite an interesting question: How can I show the parameter display name in my Pentaho report if it is different from the parameter value?

Note: Just to clarify, the scenario covered here is when the parameter value and display name are different. So in example when you set the parameter value on an id field and the name on the descriptive field. Because if parameter value and display name are set to the same field, then you can simply drag and drop the parameter name onto your report.

So in our case we defined a parameter called PARAM_OFFICECODE. We set the Parameter Value to OFFICECODE (which happens to be an id) and the Parameter Display Name is set to CITY. We want to use the OFFICECODE to constrain the result set of our main report query (in our case this works better because there happens to be an index on this database table column).

In the report we would like to show in the header the selected office name (CITY) ... but how do we do this? We can not just simply drag and drop the PARAM_OFFICECODE element onto the report header, because it would only display the id (OFFICECODE) and not the display name (CITY).

You might think there should be an easy solution to this … and right you are. It’s just not as easy as it could be, but quite close …

So I quickly put together a bare bone example (don’t expect any fancy report layout … we just want to see if we can solve this problem):

Our parameter:

So if we placed this parameter element on the main report, we would just see the OFFICECODE when we ran the report. So how do we get the display name?

If it is possible to access the name field (in our case CITY) via the SQL query, we could change our main report SQL query and add it there as a new field. But this is not very efficient, right?
We could create a new query which takes the code/id (in our case OFFICECODE) as a parameter and returns the name (CITY) and then run this query in a sub-report which could return the value to the main report (this is in fact what you had to do some years back). Well, not that neat either.
Here comes the savior: The SINGLEVALUEQUERY formula function. You can find this one in the Open Formula section. Thomas posted some interesting details about it on his blog some time ago.

Basically for a very long time we had the restriction that we could only run one query to feed data to our report. With the SINGLEVALUEQUERY and MULTIVALUEQUERY formula functions you can run additional queries and return values to the main report.

So here we go … to retrieve the display value:

We create an additional query called ds_office_chosen which is constrained by the code/id and returns the (display) name: SELECT city AS office_chosen FROM offices WHERE officecode = ${param_officecode}
We create a new formula element called formula_office_chosen and reference the query ds_office_chosen: =SINGLEVALUEQUERY("ds_office_chosen")
We can now use formula_office_chosen in our report:

Once this is set up, we can run the report and the display name of the chosen parameter value will be shown:

My very simple sample report can be downloaded from here.

Posted in "Pentaho Report Designer", Parameters | No comments

Friday, 17 May 2013

New London Pentaho Usergroup meetup

Posted on 10:30 by Unknown

It's been a long time since the last London Pentaho Usergroup meeting happened, so it's good to see that Dan Keeley and Pedro Alves are trying to bring new life into the Usergroup by organizing a new meetup. So if you live in or around London or happen to visit London on the 20th of June, make sure you stop by (details here on the Meetup website).
It's a great opportunity to get to know key members of the Pentaho Community as well as supporters and fans and to share your ideas with them.
Matt Casters, the founder of Kettle, will be presenting how to use Pentaho Kettle (PDI) to create MapReduce jobs via an easy to use graphical interface. It's unique opportunity to learn about this!
So I hope I see some of you there and have an interesting discussion about data integration, business intelligence etc with you!

Posted in | 1 comment

Wednesday, 10 April 2013

Advanced routing in Pentaho Kettle jobs

Posted on 11:39 by Unknown

In this article we will take a look at how to create some complex routing conditions for a Pentaho Data Integration (Kettle) job.

Out-of-the-box Kettle comes already with several easy to use conditional job entries:

In some situations though you might need a bit a bit more flexibility, this is when the JavaScript job entry comes into play:
This one is found in the Scripting folder. The name used in the configuration dialog of this particular step is from my point of view better actually better suited: Evaluating JavaScript.

We will look at a very trivial example:
In this job flow we only want to execute the Write To Log Sunday job entry if the day of the week is a Sunday. On all other days we want to execute the job entry Write to Log.

The Evaluating JavaScript job entry is configured as shown in the screenshot below:
Note that you can write multiple lines of code, but you must make sure that the return value is a boolean value!

In case you want to create this example yourself, please find below the JavaScript code:
var d = new Date();
var dof = d.getDay();
dof == 6 ? true : false;

Running this ETL process on a Wednesday will show the following in the log:

As you see it is rather simple creating more complex conditions and the bonus is that you can make use of a scripting language which you probably already know: JavaScript.

More information about this job entry can be found on the Pentaho Wiki.

You can download the sample job file from here. This file was created in PDI 4.4 stable, which means that you should only open it in PDI 4.4 or newer.

Posted in ETL, Pentaho Data Integration, Routing | No comments

Wednesday, 20 March 2013

Partitioning data on clustered Pentaho Kettle ETL transformations

Posted on 08:38 by Unknown

This is the second article on clustering ETL transformations with Pentaho Kettle (Pentaho Data Integration). It is highly recommended that you read the first article Creating a clustered transformation in Pentaho Kettle before continuing with this one. Make sure that the slave and master servers are running and the cluster schema is defined - as outlined in the first article.

Prerequisites:

Current version of PDI installed.
Download the sample transformations from here.

How to create a partitioning schema
Create a new transformation (or open an existing one). Click on the View tab on the left hand side and right click on Partition schemas. Choose New:
In our case we want to define a dynamic schema. Tick Dynamically create the schema definition and set the Number of partitions by slave server to 1:
How to assign the partition schema
Right click on the step that you want to assign the partition schema to and choose Partitioning.
You will be given following options:
For our purposes we want to choose Remainder of division. In the next dialog choose the partitioning schema you created earlier on:
Next specify which field should be used for partitioning. In our case this is the city field:
That’s it. Now partitioning will be dynamically applied to this step.
Why apply data partitioning on distributed ETL transformation?
As we have 2 slave servers running (setup instructions can be find in the first article), the data will be dynamically partitioned into 2 sets based on the city field. So even if we do an aggregation on the slave servers, we will derive a clean output set on the server. To be more precise: If we don’t use partitioning in our transformation, each slave server would received data in a round robin fashion (randomly), so each data set could contain records for New York in example. Each slave creates an aggregate and when we combine the data on the master we can possibly end up we two aggregates for New York. This would then require an additional sort and aggregation step on the master to arrive at a final clean aggregate. To avoid this kind of scenario, it is best to define data partitioning, so that each slave server receives a “unique” set of data. Note, this is just one reason why you should apply partitioning.

No partitioning schema applied:
With partitioning schema applied:
Notice the difference between the two output datasets!

Also note the additional red icon [Dx1] in the above screenshot of the transformation. This indicates that a partitioning schema is applied to this particular step.

At the end of this second article I hope that you got a good overview of the Pentaho Kettle clustering and partitioning features which are very useful when you are dealing with a lot of data. My special thanks go to Matt and Slawo for shedding some light into this very interesting functionality.

Posted in | No comments

Creating a clustered transformation in Pentaho Kettle

Posted on 02:54 by Unknown

Prerequisites:

Current version of PDI installed.
Download the sample transformations from here.

Navigate to the PDI root directory. Let’s start three local carte instances for testing (Make sure these ports are not in use beforehand):

sh carte.sh localhost 8077

sh carte.sh localhost 8078

sh carte.sh localhost 8079

In PDI Spoon create a new transformation.

Click on the View tab on the left hand side and right click on Slave server and choose New. Add the Carte servers we started earlier on one by one and define one as the slave server. Note the default carte user is cluster and the default password is cluster.
Next right click on Kettle cluster schemas and choose New.
Provide a Schema name and then click on Select slave servers. Mark all of them in the pop-up window and select OK.
Next we want to make sure that Kettle can connect to all of the carte servers. Right click on the cluster schema you just created and choose Monitor all slave servers:
For each of the servers Spoon will open a monitoring tab/window. Check the log in each monitoring window for error messages.

Additional info: Dynamic clusters
If the slave servers are not all known upfront, can be added or removed at any time, Kettle offers as well a dynamic cluster schema. A typical use case is when running a cluster in the cloud. With this option you can also define several slave servers for failover purposes. Take a look at the details on the Pentaho Wiki.

If Kettle can connect to all of them without problems, proceed as follows:

How to define clustering for a step
Add a Text input step for example.
Right click on the Text input step and choose Clustering.
In the Cluster schema dialog choose the cluster schema you created earlier on:
Click OK.
Note that the Text input step has a clustering indicator now:
Note: Only the steps that you assign the cluster schema this way will be run on the slave servers. All other ones will be run on the master server.

Our input dataset:

Creating swimlanes
In this example we will be reading the CSV files directly from the slave servers. All the steps will be executed on the slaves (as indicated by the Cx2).

To run the transformation on our local test environment, click the execute button and choose Execute clustered:

The last option Show transformations is not necessary for running the transformation, but helps to understand how Kettle creates individual transformations for your slave servers and master server in the background.

As we test this locally, the results will be read from the same file twice (we have two slave servers running locally and one master server) and will be output to the same file, hence we see the summary twice in the same file:

Debugging: Observer the logs of the slave and master servers as the main transformation log in Spoon (v4.4) doesn’t seem to provide you an error logs/messages in clustered execution. So always monitor the server logs while debugging!
Preview: If you perform preview on a step, a standard (non-clustered) transformation will be run.

Summarizing all data on the master
Now we will change the transformation so that the last 3 steps run on the master (notice that these steps do not have a clustering indicator):
If we execute the transformation now, the result looks like this:
So as we expect, all the data from all the slaves is summarized on the master.

Importing data from the master
Not in all cases will the input data reside on the slave servers, hence we will explore a way to input the data from the master:

Note that in this case only the Dummy step runs on the slave server.

Here is the output file:
So what happens is that the file will be input the data on the master, records will be distributed to the dummy steps running on the slave server and then aggregated on the master again.

My special thanks go to Matt and Slawo for shedding some light into this very interesting functionality.

Posted in Clustered transformation, Pentaho Data Integration, Pentaho Kettle | No comments

Thursday, 7 March 2013

Pentaho Kettle (PDI): Get Pan and Kitchen Exit Code

Posted on 14:26 by Unknown

Various monitoring applications require the exit code/status of a process as an input.

A simple example (test1.sh):

#!/bin/bash

echo "Hi"

exit $?

Let’s run it:

$ ./test1.sh

Let’s check the exit status (of the last command) which can be accessed via $?:

$ echo $?

Let’s take a look at how we can get the exit status from Pan and Kitchen:

For demonstration purposes we create a very simple dummy transformation which just outputs some data to the log:

Now create a shell file:

#!/bin/bash

/opt/pentaho/pdi/pdi-ce-4.4.0-stable/pan.sh -file='/home/dsteiner/Dropbox/pentaho/Examples/PDI/exit_code/tr_dummy.ktr' -Level=Basic > /home/dsteiner/Dropbox/pentaho/Examples/PDI/exit_code/err.log

echo $?

Note the echo $? in the last line which will return the exit status. This is for demonstration purposes here only. Normally you would use exit $? instead.

On Windows use instead:

echo %ERRORLEVEL%

Now lets run the shell script:

The exit status tells us that the transformation was executed successfully.

Next we will introduce an error into the transformation. I just add a formula step with a wrong formula:

We run the shell script again and this time we get a return code other than 0:

Any return code other than 0 means it is an error.

Please find below an overview of all the return codes (src1, src2):

Error Code	Description
0	The job ran without a problem
1	Errors occurred during processing
2	An unexpected error occurred during loading / running of the job / transformation, an error in the XML format, reading the file, problems with the repository connection, ...
3	unable to connect to a database, open a file or other initialization error.
7	The job / transformation couldn't be loaded from XML or the Repository
8	Error loading job entries or steps or plugins (error in loading one of the plugins mostly).one of the plugins in the plugins/ folder is not written correctly or is incompatible. You should never see this anymore though. If you do it's going to be an installation problem with Kettle.
9	Command line usage printing

Posted in Pentaho Data Integration, Pentaho Kettle | No comments

Sunday, 3 February 2013

Mondrian: Consequences of not defining an All member

Posted on 10:40 by Unknown

To come straight to the point: If you do not define an all member for a hierarchy Mondrian will implicitly create a slicer with the default member of the dimension … this is even happening if you do not mention the dimension at all in your MDX query!

In example take following MDX:
SELECT
[Measures].[Sales] ON 0,
[Sales Channels].[Sales Channel].Children ON 1
FROM
[Sales]

If we take a look at the SQL that Mondrian generates, we suddenly see that it tries to restrict on the year 2012 in the join condition:

Why is this happening? The reason lies in the fact that one of the hierarchies of the date dimension does not have an All member. So Mondrian tries to find the first member of this hierarchy (as this is the default member), which happens to be [Year]. And as in this case I only had data as of the year 2012 in the date dimension table, it was used in the join.

<Hierarchies>
<Hierarchy name="Time" hasAll="false">
<Level attribute="Year" />
<Level attribute="Quarter" />
<Level attribute="Month" />
<Level attribute="Day"/>
</Hierarchy>
<Hierarchy name="Weekly" hasAll="true">
<Level attribute="Year" />
<Level attribute="Week"/>
<Level attribute="Weekday"/>
</Hierarchy>
</Hierarchies>

Note if we use a hierarchy of the Date dimension in the MDX then everything works as expected:

So it is really important to keep in mind what consequences not defining an All member has!

Posted in JasperSoft, Mondrian, multidimensional modeling, OLAP, Pentaho, Saiku | No comments

Community Server

Friday, 24 May 2013

Pentaho Report Designer: How to show the parameter display name in your report when it is different from the parameter value

Friday, 17 May 2013

New London Pentaho Usergroup meetup

Wednesday, 10 April 2013

Advanced routing in Pentaho Kettle jobs

Wednesday, 20 March 2013

Partitioning data on clustered Pentaho Kettle ETL transformations

How to create a partitioning schema

How to assign the partition schema

Why apply data partitioning on distributed ETL transformation?

Creating a clustered transformation in Pentaho Kettle

Prerequisites:

How to define clustering for a step

Creating swimlanes

Summarizing all data on the master

Importing data from the master

Thursday, 7 March 2013

Pentaho Kettle (PDI): Get Pan and Kitchen Exit Code

Sunday, 3 February 2013

Mondrian: Consequences of not defining an All member

Popular Posts

Categories

Blog Archive

About Me