Community Server

Monday, 23 May 2011

Tutorial Aggregated Tables for Mondrian (Web)

Posted on 06:39 by Unknown

Tutorial Aggregated Tables for Mondrian
(Republished Tutorial - originally published July 2009)

For better readability please open this article as Google Document here.

Requirements
Know what you are up to

How to enable MDX and SQL logging on BI Server 3
How to configure Mondrian to recognize aggregated tables
Table Names
Column Names

Your first aggregated tables

Aggregated table for login channel
Aggregated table for country

Make use of aggregates in other programs

I've started using Pentaho BI Suite in early 2009. One of my main objectives was to create a multidimensional model for DSS (Decission Support Systems) and Analysis. As the BI Server already comes with Mondrian, it is fairly easy to start off.
I created my first cube definition with the Schema Workbench. The cube definition can be saved as an XML file. This XML file you can then export to your solution folder on the BI Server.
Soon we realized the we will need aggregated tables, as some MDX queries took rather long. I had a look at the Pentaho Aggregation Designer 1.1 and thought that this would be a rather easy task.
One thing the Aggregation Designer does is to check if your Schema (the cube definition xml file) is valid, it also checks if your tables are all properly set up, which is indeed a very good functionalilty, as it will show you an errors and you can correct them then.
I used the Advisor Function within the aggregation designer and created all the recommended tables with the DDL and DML (which the Aggregation Designer creates automatically) and exported the new Cube Schema as well.
But the problem was, that everything didn't work. The problem was that I thought that the Aggregation Designer would do everything, but I was quite wrong about that (at the time of the writing Aggregation Designer 1.1 was available), or maybe, I was just doing something wrong. After some unsuccessful approaches I decided to not use the Aggregation Designer, but to create everything from scratch. I am not recommending here that you shouldn't use the Aggregation Designer, it is definitely worth having a look at it. I am just describing here the approach that worked for me. I am also assuming that you use the standard configuration of Mondrian.

Requirements

This article assumes that you are familiar with

MDX
MySQL (or SQL in general)
Pentaho BI Suite
Pentaho Schema Workbench
Pentaho Design Studio

Know what you are up to

While creating aggregated tables it makes sense to test if they are used for the purpose that you are creating them for. You might have a MDX query that takes quite long to execute and hence you will want to create a dedicated aggregated table therefore. Have a look at the mondrian_sql.log to see which SQL query Mondrian actually creates for your MDX.
Your aggregated table will look quite similar to it.
When you execute the MDX queries, always have a look at these logs in the beginning, you will learn at lot!

How to enable MDX and SQL logging on BI Server 3

Whenever an MDX query is executed, Mondrian transforms it into a SQL query. In some cases you want to see what the resulting SQL query is, for example to check if Mondrian is using aggregated tables.

Go to \tomcat\webapps\pentaho\WEB-INF\classes and change log4j.xml. Just uncomment the last half of the file where it says:


   
   
   

   <appender name="MONDRIAN" class="org.apache.log4j.RollingFileAppender">
     <param name="File" value="mondrian.log"/>
     <param name="Append" value="false"/>
     <param name="MaxFileSize" value="500KB"/>
     <param name="MaxBackupIndex" value="1"/>

     <layout class="org.apache.log4j.PatternLayout">
       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
     </layout>
   </appender>

   <category name="mondrian">
      <priority value="DEBUG"/>
      <appender-ref ref="MONDRIAN"/>
   </category>

   
   
   

   <appender name="MDXLOG" class="org.apache.log4j.RollingFileAppender">
     <param name="File" value="mondrian_mdx.log"/>
     <param name="Append" value="false"/>
     <param name="MaxFileSize" value="500KB"/>
     <param name="MaxBackupIndex" value="1"/>
     <layout class="org.apache.log4j.PatternLayout">
       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
     </layout>
   </appender>

   <category name="mondrian.mdx">
      <priority value="DEBUG"/>
      <appender-ref ref="MDXLOG"/>
   </category>

   
   
   

   <appender name="SQLLOG" class="org.apache.log4j.RollingFileAppender">
     <param name="File" value="mondrian_sql.log"/>
     <param name="Append" value="false"/>
     <param name="MaxFileSize" value="500KB"/>
     <param name="MaxBackupIndex" value="1"/>
     <layout class="org.apache.log4j.PatternLayout">
       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
     </layout>
   </appender>

   <category name="mondrian.sql">
      <priority value="DEBUG"/>
      <appender-ref ref="SQLLOG"/>
   </category>

If you want the SQL to be displayed in a more readable form, add this to the mondrian.properties in pentaho-solutions\system\mondrian (BI Server 3):

mondrian.rolap.generate.formatted.sql=true

Restart the server, then start one xaction that uses Jpivot/MDX. In Tomcat/bin you will find a mondrian.log, mondrian_sql.log and a mondrian_mdx.log now.

How to configure Mondrian to recognize aggregated tables

Well, you have to tell Mondrian as well that aggregated tabels exist. The first step is to add following lines to the Mondrian.properties, which you can find on pentaho-solutions\system\mondrian (BI Server 3):

mondrian.rolap.aggregates.Use=true
mondrian.rolap.aggregates.Read=true

Restart the server.
Furthermore, with the standard setup the table names have to follow a convention (I've copied here a bit from Julian Hyde's excellent documentation):

Table Names

As an example of applying the aggregate table name rule, let the fact table be called sales_fact_1997, the Sales cube's fact table from the FoodMart schema. Applying the specific fact table name to the regular expression template creates the following regular expression:

agg_.+_sales_fact_1997

This will match the following table names:

agg_l_05_sales_fact_1997
agg_c_14_sales_fact_1997
agg_lc_100_sales_fact_1997
agg_c_special_sales_fact_1997
AGG_45_SALES_FACT_1997
AGG_drop_time_id_sales_fact_1997

Column Names

There has to be one column called "fact_count", which stores values of a standard count.
If there are any ID columns in the aggregated table, they have to be of the same name as in the fact or dimensional table.
Level columns:

${usage_prefix}${level_column_name}: If you have a level that is used more than once, i.e. Year in a week hierarchy and a month hierarchy within the time dimension, then set a "usagePrefix" in the Schema so that they can be distinguished. I.E. usage prefix "TW_" for year in the week hierarchy (resutls in "TW_Year") and "TM_" for year in the month hierarchy (results in "TM_Year").

${hierarchy_name}_${level_name}
${hierarchy_name}_${level_column_name}
${level_column_name}

If any of these parameters have space characters, ' ', these are mapped to underscore characters, '_', and, similarly, dot characters, '.', are also mapped to underscores. So, if the hierarchy_name is "Time", level_name is "Month" and level_column_name is month_of_year, the possible aggregate table column names are:

time_month
time_month_of_year
month_of_year

For this rule, the "hierarchy_name" and "level_name" are converted to lower case while the "level_column_name" must match exactly.

Make sure the column names are in small letters!

There are three parameters to matching aggregate columns to measures:

1) ${measure_name}: the fact table's cube's measure name, "measure_name",

2) ${measure_column_name}: the fact table's cube's measure column name, "measure_column_name", and

3) ${measure_column_name}_${aggregate_name}: the fact table's cube's measure's aggregator (sum, avg, max, etc.), "aggregate_name".

where the measure name is converted to lower case and both the measure column name and aggregate name are matched as they appear. If the fact table's cube's measure name was, "Avg Unit Sales", the fact table's measure column name is "unit_sales", and, lastly, the fact table's cube's measure's aggregate name is "avg", then possible aggregate table column names that would match are:

avg_unit_sales
unit_sales
unit_sales_avg

In example: A measure called "Distinct Users" has to be referenced as "distinct_users".

Your first aggregated tables

We have following tables:

The fact table:

DROP TABLE IF EXISTS `pentaho`.`sn_login`;
CREATE TABLE `pentaho`.`sn_login` (
  `msisdn` bigint(20) NOT NULL DEFAULT '0',
  `login_date_id` bigint(20) NOT NULL DEFAULT '0',
  `login_channel_id` bigint(20) NOT NULL DEFAULT '6' COMMENT 'Default Value Unknown',
  `sessions` int(255) NOT NULL DEFAULT '0',
  PRIMARY KEY (`msisdn`,`login_date_id`,`login_channel_id`),
  KEY `FK_sn_login_date_id` (`login_date_id`),
  KEY `FK_login_channel_id` (`login_channel_id`),
  KEY `login_msisdn` (`msisdn`),
  CONSTRAINT `FK_login_channel_id` FOREIGN KEY (`login_channel_id`) REFERENCES `sn_lt_communication_channels` (`communication_channel_id`) ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `FK_login_msisdn` FOREIGN KEY (`msisdn`) REFERENCES `sn_user` (`msisdn`),
  CONSTRAINT `FK_sn_login_date_id` FOREIGN KEY (`login_date_id`) REFERENCES `sn_lt_time_dimension` (`time_dimension_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The time dimension table:

DROP TABLE IF EXISTS `pentaho`.`sn_lt_time_dimension`;
CREATE TABLE `pentaho`.`sn_lt_time_dimension` (
  `the_year` int(11) DEFAULT NULL,
  `the_week_of_the_year` int(11) DEFAULT NULL,
  `the_month` int(11) DEFAULT NULL,
  `the_day_of_the_year` int(11) DEFAULT NULL,
  `the_day_of_the_week` int(11) DEFAULT NULL,
  `the_day_of_the_month` int(11) DEFAULT NULL,
  `the_month_name` varchar(9) DEFAULT NULL,
  `tenure_years` int(11) DEFAULT NULL,
  `tenure_days` int(11) DEFAULT NULL,
  `time_dimension_id` bigint(20) NOT NULL DEFAULT '0',
  `the_date` date DEFAULT NULL,
  `the_date_as_string` varchar(20) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  `the_quarter` int(3) DEFAULT NULL,
  PRIMARY KEY (`time_dimension_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The channel dimension table:

DROP TABLE IF EXISTS `pentaho`.`sn_lt_communication_channels`;
CREATE TABLE `pentaho`.`sn_lt_communication_channels` (
  `communication_channel_id` bigint(20) NOT NULL AUTO_INCREMENT,
  `channel_name` varchar(45) NOT NULL,
  `channel_system_name` varchar(45) NOT NULL,
  PRIMARY KEY (`communication_channel_id`)
) ENGINE=InnoDB AUTO_INCREMENT=9 DEFAULT CHARSET=latin1;

The user dimension table:

DROP TABLE IF EXISTS `pentaho`.`sn_user`;
CREATE TABLE `pentaho`.`sn_user` (
  `msisdn` bigint(20) NOT NULL DEFAULT '0',
  `sex` char(1) DEFAULT NULL,
  `registration_date_id` bigint(20) NOT NULL DEFAULT '0',
  `country_id` bigint(20) NOT NULL DEFAULT '0',
  `birthday_date_id` bigint(20) NOT NULL,
  `registration_channel_id` bigint(20) NOT NULL,
  PRIMARY KEY (`msisdn`),
  KEY `msisdn` (`msisdn`),
  KEY `FK_birthday_date_id` (`birthday_date_id`),
  KEY `FK_registration_date_id` (`registration_date_id`),
  KEY `FK_country_id_2` (`country_id`),
  KEY `FK_registration_channel_id` (`registration_channel_id`),
  CONSTRAINT `FK_birthday_date_id` FOREIGN KEY (`birthday_date_id`) REFERENCES `sn_lt_time_dimension` (`time_dimension_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
  CONSTRAINT `FK_country_id_2` FOREIGN KEY (`country_id`) REFERENCES `sn_lt_countries` (`country_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
  CONSTRAINT `FK_registration_channel_id` FOREIGN KEY (`registration_channel_id`) REFERENCES `sn_lt_communication_channels` (`communication_channel_id`) ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `FK_registration_date_id` FOREIGN KEY (`registration_date_id`) REFERENCES `sn_lt_time_dimension` (`time_dimension_id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The country dimension table:

DROP TABLE IF EXISTS `pentaho`.`sn_lt_countries`;
CREATE TABLE `pentaho`.`sn_lt_countries` (
  `country_id` bigint(20) NOT NULL DEFAULT '0',
  `country_name` tinytext,
  `country_number` double DEFAULT NULL,
  `iso_country_code` tinytext,
  PRIMARY KEY (`country_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

We have created an XML Schema (Cube Definition that looks like this):

<Schema name="Test">
<Dimension type="TimeDimension" usagePrefix="TM_" name="Time Monthly">
<Hierarchy name="All Times Monthly" hasAll="true" allMemberName="All Times Monthly" allMemberCaption="All Times Monthly" primaryKey="time_dimension_id">
<Table name="sn_lt_time_dimension">
</Table>
<Level name="Year" column="the_year" type="Integer" uniqueMembers="true" levelType="TimeYears" hideMemberIf="Never">
</Level>
<Level name="Quarter" column="the_quarter" type="Integer" uniqueMembers="false" levelType="TimeQuarters" hideMemberIf="Never">
</Level>
<Level name="Month" column="the_month" nameColumn="the_month_name" type="String" uniqueMembers="false" levelType="TimeMonths" hideMemberIf="Never">
</Level>
<Level name="Day" column="the_day_of_the_month" type="Integer" uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="StandardDimension" name="Country">
<Hierarchy name="All Countries" hasAll="true" allMemberName="All Countries" allMemberCaption="All Countries" primaryKey="country_id">
<Table name="sn_lt_countries">
</Table>
<Level name="Country" column="country_name" type="String" uniqueMembers="false" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="TimeDimension" usagePrefix="TW_" name="Time Weekly">
<Hierarchy name="All Times Weekly" hasAll="true" allMemberName="All Times Weekly" allMemberCaption="All Times Weekly" primaryKey="time_dimension_id">
<Table name="sn_lt_time_dimension">
</Table>
<Level name="Year" column="the_year" type="Integer" uniqueMembers="true" levelType="TimeYears" hideMemberIf="Never">
</Level>
<Level name="Week" column="the_week_of_the_year" type="Integer" uniqueMembers="false" levelType="TimeWeeks" hideMemberIf="Never">
</Level>
<Level name="Day" column="the_day_of_the_week" type="Numeric" uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Cube name="Users" cache="true" enabled="true">
<Table name="sn_user">
</Table>
<Dimension type="StandardDimension" name="Gender">
<Hierarchy name="All Genders" hasAll="true" allMemberName="All Genders" allMemberCaption="All Genders">
<Level name="Gender" column="sex" nameColumn="sex" type="String" uniqueMembers="true" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="StandardDimension" foreignKey="registration_channel_id" name="Registration Channel">
<Hierarchy name="All Registration Channels" hasAll="true" allMemberName="All Registration Channels" allMemberCaption="All Registration Channels" primaryKey="communication_channel_id">
<Table name="sn_lt_communication_channels">
</Table>
<Level name="Registration Channel" column="channel_name" type="String" uniqueMembers="false" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<Dimension type="StandardDimension" foreignKey="birthday_date_id" name="Age">
<Hierarchy name="All Ages" hasAll="true" allMemberName="All Ages" allMemberCaption="All Ages" primaryKey="time_dimension_id">
<Table name="sn_lt_time_dimension">
</Table>
<Level name="Age" column="age" type="Integer" uniqueMembers="false" levelType="Regular" hideMemberIf="Never">
</Level>
</Hierarchy>
</Dimension>
<DimensionUsage source="Time Monthly" name="Registration Time Monthly" caption="Registration Time Monthly" foreignKey="registration_date_id">
</DimensionUsage>
<DimensionUsage source="Country" name="Country" caption="Country" foreignKey="country_id">
</DimensionUsage>
<DimensionUsage source="Time Weekly" name="Registration Time Weekly" caption="Registration Time Weekly" foreignKey="registration_date_id">
</DimensionUsage>
<Measure name="New Registrations" column="msisdn" datatype="Numeric" aggregator="count" visible="true">
</Measure>
</Cube>

</Schema>

We are going to pay attention to three dimensions:

Country: You can see that this dimension is derived from joining 3 tables: sn_login, sn_user and sn_countries. In fact, our Schema is of the type "snowflake schema".
Login Channel: This is a simple dimension which is derived form just one join: sn_login and sn_lt_communication_channels.
Login Time: This dimension references a global dimension. Within a Cube Definition, a global dimension can be referenced within several cubes in the same Schema. This is very handy, as you don't have to recreate the same dimension over and over again.

Aggregated table for login channel

We basically want to see how many user login by week in each of the channels.

The MDX query therefore looks like:

WITH
MEMBER [Measures].[Prior Period Distinct Logins] AS
(
[Login Time Weekly].CurrentMember.PrevMember,
[Measures].[Distinct Users]
)
MEMBER [Measures].[Change in Distinct Logins] AS
(
[Measures].[Distinct Users] - [Measures].[Prior Period Distinct Logins]
)
MEMBER [Measures].[Percent Change] AS
(
([Measures].[Change in Distinct Logins]) /
([Measures].[Prior Period Distinct Logins])
)

, FORMAT_STRING = Iif(
([Measures].[Percent Change] < 0),
"|#.00%|style='red'|arrow='down'",
"|#.00%|style='green'|arrow='up'"
)
SELECT
NON EMPTY {

LastPeriods(
6,
[Login Time Weekly].[{week1_year}].[{week1_weeknr}]
)

}
ON COLUMNS,
NON EMPTY Union (
Crossjoin (
{[Login Channel].[All Login Channels]},
{[Measures].[Distinct Users],
[Measures].[Percent Change]
}
),
Crossjoin (
{[Login Channel].[All Login Channels].Children},
{[Measures].[Distinct Users]
}
)
)
ON ROWS
from [Logins]

We create an Xaction with a JPivot viewer in the Design Studio, save it into one solution folder and execute it online. Now we open the mondrian_sql.log and we see that Mondrian create a lot of queries. Analyse these queries carefully.

Now, having looked at the queries in the mondrian_sql.log, I decided to create following aggregated tables:

-- channel / week -------------------------------

DROP TABLE IF EXISTS `agg_channel_week_sn_login`;

CREATE TABLE `agg_channel_week_sn_login` (
`login_channel_id` BIGINT(20),
`TW_the_year` INTEGER(11),
`the_week_of_the_year` INTEGER(11),
`distinct_users` INTEGER,
`sessions` DOUBLE,
`fact_count` INTEGER,
KEY `agg_channel_week_login_channel_id` (`login_channel_id`),
KEY `agg_channel_week_TW_the_year` (`TW_the_year`),
KEY `agg_channel_week_the_week_of_the_year` (`the_week_of_the_year`)
);

INSERT INTO `agg_channel_week_sn_login` (
`login_channel_id`,
`TW_the_year`,
`the_week_of_the_year`,
`distinct_users`,
`sessions`,
`fact_count`)
select
`sn_login`.`login_channel_id` as `login_channel_id`,
`sn_lt_time_dimension_3`.`the_year` as `TW_the_year`,
`sn_lt_time_dimension_3`.`the_week_of_the_year` as `the_week_of_the_year`,
count(distinct `sn_login`.`msisdn`) as `distinct_users`,
sum(`sn_login`.`sessions`) as `sessions`,
count(*) as `fact_count`
from
`sn_login` as `sn_login`,
`sn_lt_time_dimension` as `sn_lt_time_dimension_3`
where
`sn_login`.`login_date_id` = `sn_lt_time_dimension_3`.`time_dimension_id`
group by 1,2,3
;

And also one higher level one:

-- week ---------------------------------

DROP TABLE IF EXISTS `agg_week_sn_login`;

CREATE TABLE `agg_week_sn_login` (
`TW_the_year` INTEGER(11),
`the_week_of_the_year` INTEGER(11),
`distinct_users` INTEGER,
`sessions` DOUBLE,
`fact_count` INTEGER,
KEY `agg_week_TW_the_year` (`TW_the_year`),
KEY `agg_week_the_week_of_the_year` (`the_week_of_the_year`)
);

INSERT INTO `agg_week_sn_login` (
`TW_the_year`,
`the_week_of_the_year`,
`distinct_users`,
`sessions`,
`fact_count`)
select
`sn_lt_time_dimension_3`.`the_year` as `TW_the_year`,
`sn_lt_time_dimension_3`.`the_week_of_the_year` as `the_week_of_the_year`,
count(distinct `sn_login`.`msisdn`) as `distinct_users`,
sum(`sn_login`.`sessions`) as `sessions`,
count(*) as `fact_count`
from
`sn_login` as `sn_login`,
`sn_lt_time_dimension` as `sn_lt_time_dimension_3`
where
`sn_login`.`login_date_id` = `sn_lt_time_dimension_3`.`time_dimension_id`
group by 1,2
;

Pay special attention on the naming:

The year column is named TW_the_year. Why the "TW_" prefix you wonder? This is because the year actually shows up twice in our cube schema: in the Time Week and in the Time Month dimension. We defined a usagePrefix in the cube schema in order to distinguish between these two occurrences. Here in the table definition you have to use this prefix, otherwise Mondrian is not able to distinguish between these two occurrences of the year level and will throw an error.
Every table has a column called fact_count, which is a simple count.
Most columns are in lower case, also have you realized how our Measure "Distinct Users" has changed into "distinct_users"?
All columns have the same name as they have in the dimension table (the year has obviously the prefix).
All the table names start with agg_*_yourtablename. This is very important!

Now we have to add following lines to the cube schema (the cube definition) within the table tags in order to tell Mondrian which tables to use etc [highlighted in blue]:

<Table name="sn_login">
<AggName name="agg_channel_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggForeignKey factColumn="login_channel_id" aggColumn="login_channel_id"/>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
</AggName>
<AggName name="agg_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
</AggName>
</Table>

As you can from this schema we are matching the column names of the aggregated table with the measures and dimensions that we defined in the schema itself. This way Mondrian can identify your columns and link it to a dimension/measure.

An important thing to notice here is that we actually don't have a level login_channel_id in the XML Schema (cube definition), but we can just use the AggForeignKey tag to reference it. Within the AggForeignKey you specify the name of the column in the fact table and the name of the same column in the aggregated table.

Now save everything, export everything to your solution folder on the BI Server, go back to the Pentaho website, refresh the repository, clear the Mondrian cash and try to run the Xaction. In case you get an error message, check the Mondrian log files to figure out what went wrong. It is very easy to make mistakes, especially the spelling.
Once you succeed, you will see in the mondrian_sql.log that Mondrian is in fact using your aggregated tables (and certainly you will realize that the execution is way faster).

Aggregated table for country

We have following MDX query with gives us the Top3 countries by week:

SELECT
[Measures].[Distinct Users]
ON COLUMNS,
Generate (

{LastPeriods(
6,
[Login Time Weekly].[{week1_year}].[{week1_weeknr}]
)}
,
{[Login Time Weekly].CurrentMember}*
TopCount(
[Country.All ISO Country Codes].Children,
3,

[Measures].[Distinct Users]

),
ALL
)
ON ROWS
FROM
[Logins]

Again, we create an Xaction, execute it on the website, and check the mondrian_sql.log. Then we decide to create following tables:

-- country / week -------------------------------

DROP TABLE IF EXISTS `agg_country_week_sn_login`;

CREATE TABLE `agg_country_week_sn_login` (
`iso_country_code` VARCHAR(100),
`TW_the_year` INTEGER(11),
`the_week_of_the_year` INTEGER(11),
`distinct_users` INTEGER,
`sessions` DOUBLE,
`fact_count` INTEGER,
KEY `agg_country_week_iso_country_code` (`iso_country_code`),
KEY `agg_country_week_TW_the_year` (`TW_the_year`),
KEY `agg_country_week_the_week_of_the_year` (`the_week_of_the_year`)
);

INSERT INTO `agg_country_week_sn_login` (
`iso_country_code`,
`TW_the_year`,
`the_week_of_the_year`,
`distinct_users`,
`sessions`,
`fact_count`)
select
`sn_lt_countries`.`iso_country_code` as `iso_country_code`,
`sn_lt_time_dimension_3`.`the_year` as `TW_the_year`,
`sn_lt_time_dimension_3`.`the_week_of_the_year` as `the_week_of_the_year`,
count(distinct `sn_login`.`msisdn`) as `distinct_users`,
sum(`sn_login`.`sessions`) as `sessions`,
count(*) as `fact_count`
from
`sn_login` as `sn_login`,
`sn_user` as `sn_user`,
`sn_lt_time_dimension` as `sn_lt_time_dimension_3`,
`sn_lt_countries` as `sn_lt_countries`
where
`sn_login`.`msisdn`=`sn_user`.`msisdn` AND
`sn_login`.`login_date_id` = `sn_lt_time_dimension_3`.`time_dimension_id` AND
`sn_lt_countries`.`country_id` = `sn_user`.`country_id`
group by 1,2,3
;

As you can see this is a bit more complex, as we have to join in fact 4 tables, and the iso_country_code is derived from joining 3 tables.

We add following part to the cube definition (xml schema file) [highlighted in blue]:

<Table name="sn_login">
<AggName name="agg_channel_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggForeignKey factColumn="login_channel_id" aggColumn="login_channel_id"/>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
</AggName>
<AggName name="agg_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
</AggName>

<AggName name="agg_country_week_sn_login">
<AggFactCount column="fact_count">
</AggFactCount>
<AggMeasure column="distinct_users" name="[Measures].[Distinct Users]">
</AggMeasure>
<AggMeasure column="sessions" name="[Measures].[Sessions]">
</AggMeasure>
<AggLevel column="TW_the_year" name="[Login Time Weekly.All Times Weekly].[Year]">
</AggLevel>
<AggLevel column="the_week_of_the_year" name="[Login Time Weekly.All Times Weekly].[Week]">
</AggLevel>
<AggLevel column="iso_country_code" name="[Country.All ISO Country Codes].[ISO Country Code]">
</AggLevel>
</AggName>

</Table>

You see that in this case we don't make use of an ID, as the country ID is not directly stored in our fact table. We have again collapsed the time dimension to the week level.
We save everything, export it to the BI Server and run our Xaction. If we get an error, we check the log, if not, we are pleased that we finally got everything running!

Make use of aggregates in other programs

Other programs or packages usually don't take advantage of Mondrian aggregated tables by default. Usually they provide a mondrian.properties file to configure the exact settings. Just open the properties file and enable following properties:

# Allow the use of aggregates

mondrian.rolap.aggregates.Use=true

mondrian.rolap.aggregates.Read=true

Please find below a reference for some of the programs so that you can easily find the mondrian.properties file:

Pentaho Data Integration (Kettle): data-integration/libext/mondrian/config/mondrian.properties
Pentaho Report Designer: report-designer/resources/mondrian.properties
Schema Workbench: schema-workbench/mondrian.properties

This article is just intended as a brief introduction. In no way I claim that this is the best way to create aggregated tables, this should only give you the idea on how it works and you will have to find the best way for your solution. I have found no tuturial on the web on how to do this and it cost me a lot of time to figure out how to get everything working, so I hope that with the help of this article you won't go through the same troubles. Again, creating everything manually is not easy, always check you spellings etc! Also read the excellent documentation by Julian Hyde which you can find here.

Posted in | No comments

Saturday, 7 May 2011

Pentaho Data Integration: Supplying Variables to Shell Scripts

Posted on 15:05 by Unknown

Pentaho Data Integration (Kettle): Supplying Kettle Variables to Shell Scripts

Tutorial Details

Software: PDI/Kettle 4.1 (download here)
Knowledge: Intermediate (To follow this tutorial you should have good knowledge of the software and hence not every single step will be described)
OS: Linux or Mac OS X
Tutorial files can be downloaded here

A simple example
Kettle / Infinidb Series: Using JDNI details with colxml

Last time we had a look at Supplying command line arguments to Kettle. This time we take a look at supplying Kettle variables to the command line or shell scripts.

We will start this tutorial with a very general example (so that everybody can set it up) and in the second of half we will focus on a quite specific implementation.

Before we start, we should evaluate the situation. With shell scripts comes huge power in regards to what you can achieve with it. It's easy to fall into temptation and make use of this power, but is it to the benefit of your ETL process?

My advice is to do as much as possible with the dedicated transformation steps and job entries (file and folder management, etc). The reasons are the following (to mention some of the most important ones):

Steps are in most cases OS independent. Shell scripts are OS dependent.
Designing everything within one application (how easy is it to understand what each dedicated step does in contrast to writing custom code)
Easier maintenance

Hence, use the shell job entry only in situation when Kettle doesn't offer the functionality as a step/job entry.

A simple example

The intention of this example is that you can set it up easily and quickly learn how to pass on variables to a shell script. I strongly advice not to use this example in your real world ETL job but to use the dedicated Kettle job entry instead.

Keeping this in mind, we will create a transformation that supplies a folder name to a shell script and the shell script will then create a folder with exactly this name.

Now there are two ways to specify a script for the "Execute a shell script" job entry. One way is to insert the script directly in the job entry properties, but this approach doesn't allow you to specify any variables. To use variables, you have to create a dedicated shell file and reference it in the job entry properties.

Open Spoon (Kettle) and create a new job
Save the job as jb_supply_var_to_shell_script_simple in a convenient folder.

Open the Terminal and navigate to the same folder. Create a file called create_folder.sh using vi:
$ vi create_folder.sh

Note: If you create the shell file on a Windows environment and then copy it over to a Linux environment, make sure that you first get rid of the control characters in the file. One utility which can help you with this is dos2unix. If you don't have it installed, use your favourite package manager (i.e. $ yum install dos2unix) to install it. Once it is installed, run:
dos2unix create_folder.sh

Insert the following (Press i to insert):
mkdir $1
$1 references the first parameter that is supplied via the command line to the shell script.
Make the file executable:
$ chmod a+x create_folder.sh
Go back to Kettle and drag and drop a Start and Execute a shell script job entry onto the canvas
Create a hop between these job entries
We will now set up the folder name variable and specify a default value. Press CTRL+J to call the Job Properties dialog. Click on the Parameters tab and specify VAR_FOLDER_NAME as a parameter as test as default value. Click OK.
Double click on the Execute a shell script job entry
Click on the Script file name input field; then press CTRL+Space and choose the internal job filename directory. Click Browse and choose the shell file we just created. Note: Mac OS X users - the short cut doesn't work, because CTRL+Space is reserved for Spotlight. Either you disable the shortcut or you manually write the variable name.
For the Working directory specify the internal job filename directory variable as well.
In the Fields section supply the ${VAR_FOLDER_NAME} variable.
Save the job and execute it. If you take a look at your folder, you will see that a new subfolder named "test" was created.

Kettle / InfiniDB Series: Using Kettle JDNI details with colxml

Although I make an attempt to provide some basic information below, I recommend reading my tutorial Pentaho Data Integration and Infinidb Series: Bulk Upload to get some more information about Infinidb.This will help understand this section better.

InfiniDB is one of the more popular column oriented database. Currently Kettle doesn't have a dedicated Infinidb bulk upload step. The fastest way to load data into InfiniDB is to use their command line utility called cpimport. Cpimport only accepts job information in the form an xml file, which can be generated with colxml utility . Both cpimport and colxml reside in a sub-directory of the main InfiniDB installation and are usually only executable by the root user. In a nutshell, you tell colxml in which database (optionally table, what kind of delimiter, etc) you want your data file to be loaded and it spits out a nicely formated xml file which cpimport uses to process the data.

Now all this is an ideal candidate for the Execute a shell script job entry, because Kettle doesn't provide this functionality out of the box (yet).

Now that we all know how the process works, let's focus on our ETL process. If somebody wants to follow this section, you can download an open source version of Infinidb here (you can find detailed installation instructions on the same website).

If you are working with a lot of ETL files it is recommended to use JNDI connection details to make the maintenance easier. So instead of specifying all the database connection details for every step that requires them, we can simple store all our database connection details in the simple-jndi/jdbc.properties file, assign a name to each connection and reference this name in the Kettle step. So in the case that the database connection details change, you only have to change them in one place (namely the jdbc.properties file) and not in all Kettle steps. In our example, we will follow this approach.

The only tricky bit here is that colxml requires the database name. Kettle has actually a dedicated Properties Input step (thanks to Slawo for pointing this out) which allows you to read the jdbc.properties file and with a bit of string manipulation we can extract the database name.

For the example below to work you have create a table called fact_impression on a Infinidb database, have a working transformation that outputs a pipe separated file called fact_impression.tbl which we can bulk upload to the fact_impression table. Moreover, you have to define the database connection details in {KETTLE_ROOT}/simple-jndi/jdbc.properties.

Now let's get started. I will only describe the main points. Please find below an overview of the job:

The flow is as follows:

We provide two parameters to this job: ${VAR_DWH} which is the JNDI name we want to use for our ETL job and ${VAR_ETL_FILE} which is unsurprisingly the name of the main ETL transformation that we want to use in this job. Press CTRL+J to see the job properties. Click on the Parameters tab to see these defined parameters. The idea is that these parameters are supplied on the command line when kitchen.sh is called (see Supplying command line arguments to Kettle for more details).
The tr_get_jndi_properties transformation reads the jdbc.properties file and extracts all the database connection details for the JDNI name defined in ${VAR_DWH}. To achieve this we use some regular expressions (this technique is described in my Using Regular Expressions with Pentaho Data Integration tutorial). Note that at the end of this transformation we define the database name as a variable (dwh_database).
All data files for bulk upload have to be stored in /usr/local/Calpont/data/bulk/data/import. In this case we are a bit cautious and delete any old *.tbl file in this directory.
We execute our main ETL transformation (the filename is defined in ${VAR_ETL_FILE}). This transformation has to output a CSV file with the name fact_impression.tbl into /usr/local/Calpont/data/bulk/data/import. Make sure it is pipe delimited.
The next step prepares the Infinidb job file. The infinidb_colxml.sh Shell script job entry calls the external shell file infinidb_colxml.sh. This shell file includes the following:

rm -rf /usr/local/Calpont/data/bulk/job/Job_9991.xml
[ -f /usr/local/Calpont/data/bulk/data/import/fact_impression.tbl ] && /usr/local/Calpont/bin/colxml $1 -j 9991 -t fact_impression

This shell script basically deletes an existing Infinidb job file and if fact_impression.tbl (this is the CSV output file of our main transformation) exists it will create a new Infinidb job file. Please find all the details about this in my tutorial Pentaho Data Integration and Infinidb Series: Bulk Upload. As you can see we define an argument $1 (highlighted in yellow in the above shell script). $1 will basically accept the first parameter supplied by Kettle, which is dwh_database. Hence the order in which you specify the arguments in the Shell script job entry is really important.
Note that we define dwh_database variable as an argument in the Shell script job entry.
The Load Tables Shell script step calls cpimport and initiates a bulk upload.
The last step is not really necessary, but here we delete all data files again in the bulk upload directory.

As you can see, Kettle really offers powerful functionality. The important point is to keep in mind to only use shell scripts in cases when Kettle doesn't provide a dedicated step or job entry.

If you want to learn more about Shell scripting, consider this free book.

Posted in | No comments

Friday, 15 April 2011

Pentaho Data Integration: Excel Writer Step (Plugin)

Posted on 12:58 by Unknown

Pentaho Data Integration (Kettle): Excel Writer Step

One great thing about Kettle is that everybody with Java coding skills can just write a plugin for it and hence extend its functionality. One plugin that I want to highlight is Slawo's Excel Writer Step which in my opinion is extremely useful. Why?

Quick and easy report creation: Design your report in Excel and tell the Excel Writer step where to place the data. Here's a really cool feature: You can tell the step in which cell to place the data. So you can even use the step several times to write to different areas within the same spreadsheet. Learn how to do this be reading Slawo's excellent tutorial.
Report bursting can be now set up within Kettle: Imagine a scenario where each country manager only receives a report with data related to their country. You can set up a process now in Kettle that takes care of this.
There is always a business user who loves Excel above everything, even if you have the fanciest BI website. You can output reports to Excel using Pentaho Report Designer, but in this case the charts will only be images. If it is required to have native Excel charts, the Kettle Excel plugin in your best friend.

You can download the Excel Writer Plugin from here.

Posted in | No comments

SQLPower Wabit

Posted on 12:52 by Unknown

SQLPower Wabit: Is it the one-reporting-tool-does-it-all solution?

There are quite different kinds of reports: ranging from dashboard/high level summary reports to very granular data tables spanning 20 or so pages.

As a report designer you usually have to satisfy various clients' needs: The Operations guy wants have a very detailed report, the CEO a high level summary/dashboard style report and so on.

Now Pentaho Report Designer (PRD) is an excellent tool with millions of formatting options etc, but it is more geared towards a very granular/tabular output. If you want to create a dashboard style report, you have to use sub reports. You cannot currently design sub reports within the same window, which is quite inconvenient especially for layouting purposes. Plus every new report that you create has by default the reporting sections required for a table output style (report header, details, report footer) ... so what am I supposed to do if I only want to use one chart? Do I place it in the header, details or footer?

I guess you know where I am getting to: Why not start with a blank canvas and let the user decide what reporting object(s) they want to use?

As much as I like the Pentaho Report Designer for very granular reports, I still find it unexplainable why there is no functionality that allows dashboard style reports (it's more of a user interface problem I guess). I am not talking about interactive dashboards like the ones offered by Pentaho CDF (Community Dashboard Framework), just a combination of static charts, summary tables, crosstabs, etc.

While CDF has really great functionality, it's not possible to email the dashboard report (imagine a scenario when the CEO is on holidays and wants his high level summary to be delivered by email).

I had a quick 5 minutes look at iReport which seems to have the same shortcomings as PRD. So all of them were not really what I was looking for - the one-reporting-tool-does-it-all solution. I kept on search and found SQLPOWER Wabit:

Reading the specs on their website, this just seemed to be the tool I was looking for. I installed it and within 5 minutes I had a dashboard style report created without even reading a manual (which tells a lot about the UI). My report sourced data from Mondrian for a crosstab, and from a MySQL DB for a chart and a summary table. I ddin't have to be bother by using subreports, I could just place the required reporting objects on the canvas and I could use more than one data source as well for the report objects. Nice! Actually, very nice!

Coming back to the initial question: Is it the one-reporting-tool-does-it-all solution? For now SQLPower doesn't seem to offer the millions of formatting options PDR and iReport have, but then it is only in version 1.3.4 and my report was quite nice looking for the 5 minutes I spend on it.

Overall, SQLPOWER Wabit seems to be a very promising product (did I mention that there is an open source version?). It even offers a visual interface to query OLAP cubes. You can download it from here and give it a test ride.

Posted in | No comments

Community Server

Monday, 23 May 2011

Tutorial Aggregated Tables for Mondrian (Web)

Requirements

Know what you are up to

How to enable MDX and SQL logging on BI Server 3

How to configure Mondrian to recognize aggregated tables

Table Names

Column Names

Your first aggregated tables

Aggregated table for login channel

Aggregated table for country

Make use of aggregates in other programs

Saturday, 7 May 2011

Pentaho Data Integration: Supplying Variables to Shell Scripts

A simple example

Kettle / InfiniDB Series: Using Kettle JDNI details with colxml

Friday, 15 April 2011

Pentaho Data Integration: Excel Writer Step (Plugin)

SQLPower Wabit

SQLPower Wabit: Is it the one-reporting-tool-does-it-all solution?

Popular Posts

Categories

Blog Archive

About Me