Community Server

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Monday, 28 September 2009

Pentaho Kettle: Using Full Outer Join

Posted on 07:57 by Unknown

Today I came across a special use case for an outer join. I was planning to set up an ETL process that imports yesterday's revenue summary by summarizing the input of two tables.

The main problem was, that we might not have made any revenue yesterday at all, or there might be only revenue in one table but not the other.


Basically, if revenue figures exist in both tables, the process would be following:
- input data from table 1: date, service name, revenue_cars
- input data from table 2: date, service name, revenue_carparts

- join the tables, so that the output looks like: date, service name, revenue_cars, revenue_carparts
The problem is, as mentioned, that not every day data will be available in both tables. So I would have to implement a logic like this one:

  • IF input data table 1 NOT EMPTY AND input data table 2 EMPTY THEN output only data of table 1.
  • IF input data table 2 NOT EMPTY AND input data table 1 EMPTY THEN output only data of table 2.
  • IF input data table 1 NOT EMPTY AND input data table 2 NOT EMPTY THEN JOIN everything and output it.

All this can be achieved by using a FULL OUTER JOIN.

So your process in Kettle should be as follows:

  1. Drag and drop two "table input" steps into the working area and define the settings. Our query looks like this:For table1:
    SELECT
    date,
    service_name,
    COUNT(*) AS count
    FROM
    table1
    GROUP BY 1,2
    For table2:
    SELECT
    date,
    service_name,
    COUNT(*) AS count,
    SUM(rev) AS rev_sum
    FROM
    table2
    GROUP BY 1,2
  2. For each of them add a sort step and connect them with the table input. Make sure you sort the relevant fields in ascending order (in our case this is date and service_name).
  3. Drag and drop a Join step on the working area, connect them with the two sort steps and then define the two sort steps as input.




In this screenshot you can also see a Group By Step. This one is not in use at all. But in case you want to check if there is any output of this step, the group by step has an option called "Always give back a result row", which, when ticked, will set the aggregation to 0 (Zero). This might be useful in some cases. This is different to the Detect Empty Stream step, which only gives back a completely empty row in any case.

Ok, so now we have the FULL OUTER JOIN working. Have a look at the output by exporting it into a text or Excel file. You will see that in case both result sets didn't match, the number fields (count and rev_sum) are blank. As we want to use these fields later on in calculations, we have to assign a default value in case the number field is null.

Insert a If field value is null step, connect it to the previous step and double click on it afterwards. Tick "Select fields" and in the fields table add  count and rev_sum and assign a default value of 0 (Zero) to them.


Another thing that you will also have recognized by looking at the output is that we have now two additional fields: service_name_1 and date_1.

So now we have to do a check. In case there are matching values in both results, both

  • date and data_1 and
  • service_name and service_name_1
will be populated. If it is not in both, only one of them will be populated. So we are going to create a new variable called service_name_new and date_new that are going to replace the old ones in a "Modified Java Script Value" Step. Drag and drop this step on the working area and connect it with the Merge Join Step. Douple click on the Modified Java Script Value  Step and insert following script plus tick "Compatibility Mode":

if (service_name.isNull()) {
var service_name_new=service_name_1;
}
else
{
var service_name_new=service_name;
}

if (date.isNull()) {
var date_new=date_1;
}
else
{
var date_new=date;
}

Add service_name_new and date_new as output fields and define the type. Then add a Select values Step and add all the fields, but not service_name_1, service_name, date, date_1. Add another step to export it into Excel in example. Our whole process should look like this now (please ignore that some connection have the copy sign, this will not be the case with your process):

In this small tutorial we have now set up a nicely working FULL OUTER JOIN. I hope that you can implement this in your ETL processes now without any problems.
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Pentaho Kettle Parameters and Variables: Tips and Tricks
    Pentaho Kettle Parameters and Variables: Tips and Tricks This blog post is not intended to be a formal introduction to using parameters and ...
  • Using Parameters in Pentaho Report Designer
    Using Parameters in Pentaho Report Designer Introduction How to define a parameter Additional info about the new parameter dialog Hidden Par...
  • Pentaho Data Integration: Scheduling and command line arguments
    Pentaho Data Integration (Kettle): Command line arguments and scheduling Tutorial Details Software: PDI/Kettle 4.1 (download here ), MySQL S...
  • Jaspersoft iReport: How to pass a parameter to a sub-dataset
    Jaspersoft iReport: How to pass a parameter to a sub-dataset Let’s say our main report is grouped by continent and the details band holds sa...
  • Using regular expressions with Pentah...
    Using regular expressions with Pentaho Data Integration (Kettle) There are quite some transformations steps that allow you to work with regu...
  • Pentaho Data Integration and Infinidb Series: Bulk Upload
    Pentaho Data Integration and InfiniDB Series: Bulk Upload Introduction Prepare Tables Using mainly Kettle steps Check if file exists Setup I...
  • Pentaho Data Integration: Remote execution with Carte
    Pentaho Data Integration: Remote execution with Carte Tutorial Details Software: PDI/Kettle 4.1 (download  here ), installed on your PC and ...
  • How to create a loop in Pentaho Kettle
    I finished my first ever video tutorial! This video will demonstrate you how easy it is to create a loop in Pentaho Kettle. Enjoy!
  • Metadata Driven ETL and Reporting
    Metadata Driven ETL and Reporting with Pentaho Data Integration and Report Designer Tutorial Details  Software : If PDI Kettle 4.2 GA and PR...
  • Understanding the Pentaho Kettle Dimension Insert/Update Step Null Value Behaviour
    We will be using a very simple sample transformation to test the null value behaviour: We use the Data Grid step to provide some sample dat...

Categories

  • "Bulk Loader"
  • "Bulk Loading"
  • "Hadoop"
  • "Kettle"
  • "Pentaho Book"
  • "Pentaho Data Integration"
  • "Pentaho Kettle"
  • "Pentaho Report Designer MDX MQL JDBC Parameters How To"
  • "Pentaho Report Designer MDX Parameters"
  • "Pentaho Report Designer MQL Parameters"
  • "Pentaho Report Designer Parmaters"
  • "Pentaho Report Designer"
  • "Pentaho Reporting 3.5 for Java Developers"
  • "Pentaho Reporting Book"
  • "Routing"
  • "Schema Workbench"
  • "Testing"
  • "Unicode"
  • "Unit testing"
  • "UTF8"
  • Agile development
  • automated testing
  • Big Data
  • Book Review
  • C-Tools
  • CBF
  • Clustered transformation
  • Command Line Arguments
  • Community Build Framework
  • D3JS
  • Dashboarding
  • Data Integration
  • Data Warehouse
  • Database Change Management
  • Database Version Control
  • Date Dimension
  • DBFit
  • ETL
  • ETLFit
  • Federated database
  • Google Charts
  • Google Visualization API
  • Hadoop
  • HTML5
  • iReport
  • JasperReports
  • JasperSoft
  • JasperStudio
  • Kettle
  • Kimball
  • Loop
  • Master data management
  • Metadata
  • Metedata editor
  • Mondrian
  • multidimensional modeling
  • OLAP
  • Open Source
  • Parameter
  • Parameters
  • Pentaho
  • Pentaho BI Server
  • Pentaho Data Integration
  • Pentaho Data Integration 4 Cookbook
  • Pentaho Kettle
  • Pentaho Metadata Editor Tutorial
  • Pentaho Report Designer
  • PostgreSQL
  • PRD
  • Report Layout
  • REST
  • Routing
  • Saiku
  • Scheduling
  • Slowly Changing Dimension
  • Sqitch
  • SVG
  • Talend
  • Talend MDM
  • Talend Open Studio
  • Tutorial
  • Variable
  • Web service
  • Xactions

Blog Archive

  • ►  2013 (24)
    • ►  December (2)
    • ►  November (3)
    • ►  October (2)
    • ►  September (1)
    • ►  August (3)
    • ►  July (2)
    • ►  June (1)
    • ►  May (2)
    • ►  April (1)
    • ►  March (3)
    • ►  February (1)
    • ►  January (3)
  • ►  2012 (20)
    • ►  November (3)
    • ►  October (3)
    • ►  August (1)
    • ►  June (1)
    • ►  April (1)
    • ►  March (3)
    • ►  February (5)
    • ►  January (3)
  • ►  2011 (19)
    • ►  November (3)
    • ►  July (2)
    • ►  June (1)
    • ►  May (4)
    • ►  April (2)
    • ►  March (1)
    • ►  February (3)
    • ►  January (3)
  • ►  2010 (17)
    • ►  December (1)
    • ►  November (6)
    • ►  September (1)
    • ►  August (1)
    • ►  June (2)
    • ►  May (1)
    • ►  April (3)
    • ►  February (1)
    • ►  January (1)
  • ▼  2009 (18)
    • ►  December (3)
    • ►  November (1)
    • ►  October (5)
    • ▼  September (7)
      • Pentaho Report Designer 3.5: Not mand...
      • Pentaho Kettle: Using Full Outer Join
      • Full Review of "Pentaho Reporting 3.5 for Java Dev...
      • New books arrived
      • Review "Pentaho Reporting 3.5 for Java Developers"...
      • Review coming soon ...
      • New Books on Pentaho
    • ►  July (2)
Powered by Blogger.

About Me

Unknown
View my complete profile