Community Server

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Tuesday, 23 July 2013

Introducing the Kettle Test Framework Beta

Posted on 06:53 by Unknown
Kettle Test Framework (KTF)

Subtitle: Kettle Testing for the Non-Java Developers

Announcing the KTF Beta:

Precautions



Please note that KTF is still in Beta and has undergone only minimal testing. Please report any bugs on the dedicated Github page so that they can be easily fixed for everybody’s advantage. Do not use for any production purposes.


You must not run this process on a production environment! You should only run this process on a dedicated test environment where it is ok to lose all the data in the database. You must run this process on a dedicated test database! This process wipes out all your tables!

The small print upfront

Please note that this is a community contribution and not associated with Pentaho. You can make use of this framework at your own risk. The author makes no guarantees of any kind and should not be hold responsible for any negative impact.  


You should have a solid understanding of Pentaho Data Integration/Kettle. I made a minimal attempt to document this framework - the rest is up to you to explore and understand.

Motivation

The main idea behind this framework is to create a base for best test practises in regards to working with the Kettle ETL tool. Please add any ideas which could improve this test framework as “improvement” on the dedicated Github page.


Code and samples can be downloaded from Github.


Testing data integration processed should be core part of your activities. Unfortunately, especially for non Java developers, this is not quite so straightforward (Even for Java developers it is not quite that easy to unit test their ETL processes, as highlighted here.). This framework tries to fill this gap by using standard Kettle transformations and jobs to run a test suite.

When you create or change a data integration process, you want to be able to check if the output dataset(s) match the ones you expect (the "golden" dataset(s)). Ideally, this process should be automated as well. By using KTF's standard Kettle transformations and jobs to do this comparison every data integration architect should be in the position to perform this essential task.


Some other community members have published blog posts on testing before, which this framework strongly took ideas/inspiration from (especially Dan Moore’s excellent blog posts [posts, github] ). Also, some books published on Agile BI methodologies were quite inspirational (especially Ken Colliers “Agile Analytics”) as well.
While Dan focused on a complete file based setup, for now I tried to create a framework which works with processes (jobs, transformations) which make use of Table input and Table output steps. In the next phase the aim is to support file based input and output (csv, txt) as well. Other features are listed below.

Contribute!

Report bugs and improvements/ideas on Github.


Features

Let’s have a look at the main features:
  • Supports multiple input datasets
  • Supports multiple output datasets
  • Supports sorting of the output dataset so that a good comparison to the golden output dataset can be made
  • The setup is highly configurable (but a certain structure is enforced - outlined below)
  • Non conflicting parameter/variable names (all prefixed with “VAR_KTF_”)
  • Non intrusive: Just wraps around your existing process files (except some parameters for db connections etc will have to be defined … but probably you have this done already anyways)

Current shortcomings

  • Not fully tested (only tested with included samples and on PostgreSQL)
  • Currently works only with Table input / output transformations. Text/CSV file input/output will be supported in a future version (which should not be too complicated to add).
  • Dictates quite a strict folder structure
  • Limited documentation

Project Folder Structure

Stick to this directory layout for now. In future versions I might make this more flexible.


  1. Adjust parameter values in config/.kettle/kettle.properties and JNDI details in config/simple-jndi/jdbc.properties
  2. Your transformations have to be prefixed for now with “tr_” and your jobs with “jb_”. Do not use any special characters or spaces in your job/transformation names. Your transformations and jobs have to be saved within repository/main. There is currently now subfolder structure allowed within this folder.
  3. Within the config/test-cases directory create a folder for the processes you want to test. A process can be a transformation or a job. Name these folders exactly the same as the job/transformation you want to test (just without the file extension).  Each process folder must have an input and output folder which hold the DDL, Kettle data type definitions and in case of the output the sort order definition (see tr_category_sales sample on how to set this up). If your output dataset does not require any sorting, create an empty sort def file (see tr_fact_sales example). Note that KTF can handle more than one output/input dataset.
  4. The process folder must also contain at least one test case folder (which has to have a descriptive name). In the screenshot above it is called “simpletest”. A test case folder must contain an input and output folder which each hold the dedicated datasets for this particular test case. In case of the output folder it will hold the golden output dataset(s) (so that dataset that you want to compare your ETL output results to).
  5. Users working on Windows: For all the CSV output steps in transformations under /repository/test set the Format to Windows (Content tab). KTF has not been tested at all on Windows, so you might have to make some other adjustments as well.
  6. Add environment variables defined in config/set-project-variables.sh to your .bashrc. Then run: source ~/.bashrc
  7. Start Spoon (this setup requires PDI V5) or run the process from the command line.
  8. Run the test suite
  9. Analyze results in tmp folder. If there is an error file for a particular test case, you can easily visually inspect the differences like this:
dsteiner@dsteiner-Aspire-5742:/tmp/kettle-test/tr_category_sales/sales_threshold$ diff fact_sales_by_category.csv fact_sales_by_category_golden.csv
2c2
< 2013-01-01;Accessories;Mrs Susi Redcliff;399;Yes
---
> 2013-01-01;Accessories;Mrs Susi Redcliff;399;No
5c5
< 2013-01-02;Groceries;Mr James Carrot;401;No
---
> 2013-01-02;Groceries;Mr James Carrot;401;Yes


All files related to testing (KTF) are stored in repository/test. You should not have to alter them unless you find a bug or want to modify their behaviour.


To get a better idea on how this is working, look at the included examples, especially tr_category_sales (has multiple inputs and outputs and proper sorting). The other example, tr_fact_sales has only one output and input and no sorting defined (as it only outputs one figure).

Future improvements

Following improvements are on my To-Do list:
  • Write test results to dedicated database table
  • Improvement of folder structure
  • Support for text file input and output for main processes (jobs/transformations)

FAQ

My input data sets come from more than one data source. How can I test my process with the KTF?

Your process must have parameter driven database connections. This way you can easily point your various JNDI connections to just one testing input database. The main purpose of testing is to make sure if the output is as expected, not to test various input database connections. Hence for testing, you can “reduce” you multiple input connections to one.

Email ThisBlogThis!Share to XShare to Facebook
Posted in "Kettle", "Pentaho Data Integration", "Testing", "Unit testing" | 1 comment
Newer Post Older Post Home

1 comment:

  1. daianazabka25 February 2022 at 04:21

    How to get to the Gold Coast Casino by Bus from the
    A ticket for Gold 해외 배팅 업체 Coast Casino's $1.1 billion casino project is posted here. 바카라 검증사이트 The address 포커 게임 was 1 Main Street in downtown e sport Sydney, 해외배당

    ReplyDelete
    Replies
      Reply
Add comment
Load more...

Subscribe to: Post Comments (Atom)

Popular Posts

  • Pentaho Kettle Parameters and Variables: Tips and Tricks
    Pentaho Kettle Parameters and Variables: Tips and Tricks This blog post is not intended to be a formal introduction to using parameters and ...
  • Using Parameters in Pentaho Report Designer
    Using Parameters in Pentaho Report Designer Introduction How to define a parameter Additional info about the new parameter dialog Hidden Par...
  • Pentaho Data Integration: Scheduling and command line arguments
    Pentaho Data Integration (Kettle): Command line arguments and scheduling Tutorial Details Software: PDI/Kettle 4.1 (download here ), MySQL S...
  • Jaspersoft iReport: How to pass a parameter to a sub-dataset
    Jaspersoft iReport: How to pass a parameter to a sub-dataset Let’s say our main report is grouped by continent and the details band holds sa...
  • Using regular expressions with Pentah...
    Using regular expressions with Pentaho Data Integration (Kettle) There are quite some transformations steps that allow you to work with regu...
  • Pentaho Data Integration and Infinidb Series: Bulk Upload
    Pentaho Data Integration and InfiniDB Series: Bulk Upload Introduction Prepare Tables Using mainly Kettle steps Check if file exists Setup I...
  • Pentaho Data Integration: Remote execution with Carte
    Pentaho Data Integration: Remote execution with Carte Tutorial Details Software: PDI/Kettle 4.1 (download  here ), installed on your PC and ...
  • How to create a loop in Pentaho Kettle
    I finished my first ever video tutorial! This video will demonstrate you how easy it is to create a loop in Pentaho Kettle. Enjoy!
  • Metadata Driven ETL and Reporting
    Metadata Driven ETL and Reporting with Pentaho Data Integration and Report Designer Tutorial Details  Software : If PDI Kettle 4.2 GA and PR...
  • Understanding the Pentaho Kettle Dimension Insert/Update Step Null Value Behaviour
    We will be using a very simple sample transformation to test the null value behaviour: We use the Data Grid step to provide some sample dat...

Categories

  • "Bulk Loader"
  • "Bulk Loading"
  • "Hadoop"
  • "Kettle"
  • "Pentaho Book"
  • "Pentaho Data Integration"
  • "Pentaho Kettle"
  • "Pentaho Report Designer MDX MQL JDBC Parameters How To"
  • "Pentaho Report Designer MDX Parameters"
  • "Pentaho Report Designer MQL Parameters"
  • "Pentaho Report Designer Parmaters"
  • "Pentaho Report Designer"
  • "Pentaho Reporting 3.5 for Java Developers"
  • "Pentaho Reporting Book"
  • "Routing"
  • "Schema Workbench"
  • "Testing"
  • "Unicode"
  • "Unit testing"
  • "UTF8"
  • Agile development
  • automated testing
  • Big Data
  • Book Review
  • C-Tools
  • CBF
  • Clustered transformation
  • Command Line Arguments
  • Community Build Framework
  • D3JS
  • Dashboarding
  • Data Integration
  • Data Warehouse
  • Database Change Management
  • Database Version Control
  • Date Dimension
  • DBFit
  • ETL
  • ETLFit
  • Federated database
  • Google Charts
  • Google Visualization API
  • Hadoop
  • HTML5
  • iReport
  • JasperReports
  • JasperSoft
  • JasperStudio
  • Kettle
  • Kimball
  • Loop
  • Master data management
  • Metadata
  • Metedata editor
  • Mondrian
  • multidimensional modeling
  • OLAP
  • Open Source
  • Parameter
  • Parameters
  • Pentaho
  • Pentaho BI Server
  • Pentaho Data Integration
  • Pentaho Data Integration 4 Cookbook
  • Pentaho Kettle
  • Pentaho Metadata Editor Tutorial
  • Pentaho Report Designer
  • PostgreSQL
  • PRD
  • Report Layout
  • REST
  • Routing
  • Saiku
  • Scheduling
  • Slowly Changing Dimension
  • Sqitch
  • SVG
  • Talend
  • Talend MDM
  • Talend Open Studio
  • Tutorial
  • Variable
  • Web service
  • Xactions

Blog Archive

  • ▼  2013 (24)
    • ►  December (2)
    • ►  November (3)
    • ►  October (2)
    • ►  September (1)
    • ►  August (3)
    • ▼  July (2)
      • Introducing the Kettle Test Framework Beta
      • Pentaho Kettle Parameters and Variables: Tips and ...
    • ►  June (1)
    • ►  May (2)
    • ►  April (1)
    • ►  March (3)
    • ►  February (1)
    • ►  January (3)
  • ►  2012 (20)
    • ►  November (3)
    • ►  October (3)
    • ►  August (1)
    • ►  June (1)
    • ►  April (1)
    • ►  March (3)
    • ►  February (5)
    • ►  January (3)
  • ►  2011 (19)
    • ►  November (3)
    • ►  July (2)
    • ►  June (1)
    • ►  May (4)
    • ►  April (2)
    • ►  March (1)
    • ►  February (3)
    • ►  January (3)
  • ►  2010 (17)
    • ►  December (1)
    • ►  November (6)
    • ►  September (1)
    • ►  August (1)
    • ►  June (2)
    • ►  May (1)
    • ►  April (3)
    • ►  February (1)
    • ►  January (1)
  • ►  2009 (18)
    • ►  December (3)
    • ►  November (1)
    • ►  October (5)
    • ►  September (7)
    • ►  July (2)
Powered by Blogger.

About Me

Unknown
View my complete profile