Using regular expressions with Pentaho Data Integration (Kettle)
There are quite some transformations steps that allow you to work with regular expressions: Replace in string, Regex Evaluation, Modified Java Script Value to name a few.
Today we will look at one (pretty unrealistic) example using the "Replace in string" step. Basically, we will cut the first word off a text (Note: This can be achieved by using other steps as well, but as we want to focus on regular expressions, this approach was chosen):
Our first step is of the type "Data Grid", which easily allows us to define a data set to work with. In our case, we create a data point called "line_text" and add values to it: In the Meta tab you can define your data points and in the Data tab the actual values. Hit Preview to get an idea how your data set looks like (no surprises here).
To cut the first word off our line_text values, we use a "Replace in string" step. Choose line_text as "In stream field". You can leave "Out stream field" empty in case you want the output to go to line_text (so the original value will be overwritten with the new one). If you want to keep the old value, define a new data point name here.
For this example we will use a regular expression, so make sure you set "use RegEx" to Y. We will use following expression:
^([a-zA-z]+)(\s{1})(.*)$
This expression makes use of capturing groups (defined by brackets), which will allow us to specify the part that we want to keep. What this regular expression says is basically the following: the value should start with (indicated by ^) any combination of upper or smaller letters (one or more of them indicated by +). This is followed by one space character (indicated by \s{1}) and then basically any character or none can follow. Probably we could even improve this expression, but for now it does what it is supposed to do.
In the "Replace with" field we enter $3, which basically tells Kettle that we want to keep the part that is retrieved by the 3rd capturing group.
Finally, we output the results to a text file. The output looks like this:
If you are creating extremely long regular expressions, named capturing groups are quite useful. As the name indicates, you can assign a name to the capturing group and use this name later on in the back reference. Currently, Kettle doesn't support named capturing groups, but I am sure a future version will cover this.
0 comments:
Post a Comment