Community Server

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Saturday, 27 November 2010

Kettle: Handling Dates with Regular Expression

Posted on 10:47 by Unknown

Kettle: Handling Dates with Regular Expression

This is the second tutorial that focuses on using regular expressions with Pentaho Kettle (PDI). This is again a demonstration on how powerful regular expression acutally are.

In this example we are dealing with a simplified data set containing date values. The problem is, that the date doesn't have a standard format. We have following values:

date
2010/1/2
2010/01/2
2010/1/02
2010/01/02
20100102

We assume for now, that all follow at least this basic standard: year, month, day. Now, we need to somehow generate a standard date in the format yyyy-MM-dd using the value from the date field as input.

Regular expressions are of much help here, as we can use the concept of capturing groups. Capturing groups will allow us to retrieve the year, month and day parts of the date easily, then we build together our final date (we want it in the yyyy-MM-dd format). Now let's see how this is done:
For our convenience, we use the data grid step to store our dummy dates for testing:

Now we are ready to go ahead and work on our regular expression. Insert the Regex Evaluation step from the Scripting folder and configure it as shown in the screenshot below:
On the content tab make sure that you select "Permit whitespace and comments in pattern". Now let's have a look at the regular expression:

  1. The string must start with 4 numbers. We enclose the definition by brackets to create the first capturing group. Note: I added #1. This is a comment and helps to mark the capturing groups for easy reference.
  2. Next we say that a dash can follow or not. This is our 2nd capturing group.
  3. I guess you get the idea for the remaining capturing groups. In the end we make sure that nothing else follows, hence we use the dollar sign.
Once we have created our capturing groups, we can reference them in the "Capture Group Fields". First make sure that "Create fields from capture" is activated. Then fill out the fields below as shown.

This step will basically get the year, month and day parts. 

Now we have to build the date together. We can use the formula step therefore:

First we check if a value exists, if not, we set the field to 0s. In case a value exists, we check if the month and day part have a leading zero, if not, we add it. Finally, we build the whole date string together.

Now, it is not a very good approach to save non standard 0000-00-00 dates in the database (i.e.), hence we use the "Null if ..." step to set records with 0000-00-00 date to null:


The final step left to do is to convert the string to a proper date. You can do this with the "Select values" step. Go to the Meta-data tab and fill it out as shown below:
Save the transformation and run it. Now let's have a look at the output:

You can see that it is quite simple to handle non standardized data with regular expressions. You can download the transformation here.
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Pentaho Kettle Parameters and Variables: Tips and Tricks
    Pentaho Kettle Parameters and Variables: Tips and Tricks This blog post is not intended to be a formal introduction to using parameters and ...
  • Using Parameters in Pentaho Report Designer
    Using Parameters in Pentaho Report Designer Introduction How to define a parameter Additional info about the new parameter dialog Hidden Par...
  • Pentaho Data Integration: Scheduling and command line arguments
    Pentaho Data Integration (Kettle): Command line arguments and scheduling Tutorial Details Software: PDI/Kettle 4.1 (download here ), MySQL S...
  • Jaspersoft iReport: How to pass a parameter to a sub-dataset
    Jaspersoft iReport: How to pass a parameter to a sub-dataset Let’s say our main report is grouped by continent and the details band holds sa...
  • Using regular expressions with Pentah...
    Using regular expressions with Pentaho Data Integration (Kettle) There are quite some transformations steps that allow you to work with regu...
  • Pentaho Data Integration and Infinidb Series: Bulk Upload
    Pentaho Data Integration and InfiniDB Series: Bulk Upload Introduction Prepare Tables Using mainly Kettle steps Check if file exists Setup I...
  • Pentaho Data Integration: Remote execution with Carte
    Pentaho Data Integration: Remote execution with Carte Tutorial Details Software: PDI/Kettle 4.1 (download  here ), installed on your PC and ...
  • How to create a loop in Pentaho Kettle
    I finished my first ever video tutorial! This video will demonstrate you how easy it is to create a loop in Pentaho Kettle. Enjoy!
  • Metadata Driven ETL and Reporting
    Metadata Driven ETL and Reporting with Pentaho Data Integration and Report Designer Tutorial Details  Software : If PDI Kettle 4.2 GA and PR...
  • Understanding the Pentaho Kettle Dimension Insert/Update Step Null Value Behaviour
    We will be using a very simple sample transformation to test the null value behaviour: We use the Data Grid step to provide some sample dat...

Categories

  • "Bulk Loader"
  • "Bulk Loading"
  • "Hadoop"
  • "Kettle"
  • "Pentaho Book"
  • "Pentaho Data Integration"
  • "Pentaho Kettle"
  • "Pentaho Report Designer MDX MQL JDBC Parameters How To"
  • "Pentaho Report Designer MDX Parameters"
  • "Pentaho Report Designer MQL Parameters"
  • "Pentaho Report Designer Parmaters"
  • "Pentaho Report Designer"
  • "Pentaho Reporting 3.5 for Java Developers"
  • "Pentaho Reporting Book"
  • "Routing"
  • "Schema Workbench"
  • "Testing"
  • "Unicode"
  • "Unit testing"
  • "UTF8"
  • Agile development
  • automated testing
  • Big Data
  • Book Review
  • C-Tools
  • CBF
  • Clustered transformation
  • Command Line Arguments
  • Community Build Framework
  • D3JS
  • Dashboarding
  • Data Integration
  • Data Warehouse
  • Database Change Management
  • Database Version Control
  • Date Dimension
  • DBFit
  • ETL
  • ETLFit
  • Federated database
  • Google Charts
  • Google Visualization API
  • Hadoop
  • HTML5
  • iReport
  • JasperReports
  • JasperSoft
  • JasperStudio
  • Kettle
  • Kimball
  • Loop
  • Master data management
  • Metadata
  • Metedata editor
  • Mondrian
  • multidimensional modeling
  • OLAP
  • Open Source
  • Parameter
  • Parameters
  • Pentaho
  • Pentaho BI Server
  • Pentaho Data Integration
  • Pentaho Data Integration 4 Cookbook
  • Pentaho Kettle
  • Pentaho Metadata Editor Tutorial
  • Pentaho Report Designer
  • PostgreSQL
  • PRD
  • Report Layout
  • REST
  • Routing
  • Saiku
  • Scheduling
  • Slowly Changing Dimension
  • Sqitch
  • SVG
  • Talend
  • Talend MDM
  • Talend Open Studio
  • Tutorial
  • Variable
  • Web service
  • Xactions

Blog Archive

  • ►  2013 (24)
    • ►  December (2)
    • ►  November (3)
    • ►  October (2)
    • ►  September (1)
    • ►  August (3)
    • ►  July (2)
    • ►  June (1)
    • ►  May (2)
    • ►  April (1)
    • ►  March (3)
    • ►  February (1)
    • ►  January (3)
  • ►  2012 (20)
    • ►  November (3)
    • ►  October (3)
    • ►  August (1)
    • ►  June (1)
    • ►  April (1)
    • ►  March (3)
    • ►  February (5)
    • ►  January (3)
  • ►  2011 (19)
    • ►  November (3)
    • ►  July (2)
    • ►  June (1)
    • ►  May (4)
    • ►  April (2)
    • ►  March (1)
    • ►  February (3)
    • ►  January (3)
  • ▼  2010 (17)
    • ►  December (1)
    • ▼  November (6)
      • How to Set Up Pentaho Community Build Framework
      • Kettle: Handling Dates with Regular Expression
      • Review "Pentaho Kettle Solutions"
      • Pentaho Kettle Data Input: Pivoted Data
      • Using regular expressions with Pentah...
      • PDI Kettle Plugins
    • ►  September (1)
    • ►  August (1)
    • ►  June (2)
    • ►  May (1)
    • ►  April (3)
    • ►  February (1)
    • ►  January (1)
  • ►  2009 (18)
    • ►  December (3)
    • ►  November (1)
    • ►  October (5)
    • ►  September (7)
    • ►  July (2)
Powered by Blogger.

About Me

Unknown
View my complete profile