12D: 1/1/16

2016-01-17

My Data Warehouse

Stockholm 2016-01-17

Recently I looked into some Data Warehouse systems. And I learned there is not anything on the market that compares with my Data Warehouse, I would not say my is better but it is very different from the others. Personally I’m convinced my Data Warehouse concept is better, but that is what we all creators of Data Warehouses say I suppose.

Most Data Warehouse architectures are based on multidimensional cubes with a fact table in the middle. This model was conceived in the late seventies when the capabilities of the infrastructure were much weaker than today. Cubes have some intrinsic merits they give a logical structure to the data, have low redundancy and are economical with disk space. But there are disadvantages of cubes which outweighs the strengths. Cubes are hard to understand and decode. Very few non-specialists can extract data directly from cubes. Cubes are complex to maintain, and they are slow to access. I prefer simple two dimensional tables containing all data needed for a report. Tables are simple to understand and easily accessible from spreadsheets which are in every business user's toolbox. Data in my warehouse is organised in tables not in cubes.

Most data warehouses store data in relational databases which are accessed and manipulated via SQL statements. For some reason most Data Warehouse system goes to great lengths to conceal SQL from the Data Warehouse admins, instead they invent their own access language or extend a common programming language with their own access libraries. I created a workflow engine and scheduler which focus on executing jobs with a minimum of ‘macro code’, data manipulation is done directly by SQL. The entire ETL process is written in SQL, instead of an obscure language only insiders understand.

ETL processes of Data Warehouses often process very large amounts of data, parallel process workflows and split workflows into parallel ‘subflows’ is of paramount importance. My implementation of parallelism via piggyback iterators, is the simplest and most elegant parallel programing algorithm I’m aware of and it is a far cry from other traditional parallel programing idioms.

My Data Warehouse: Two dimensional tables, SQL and simple sound parallelism.

2016-01-05

Balanced parallel execution

I have written some posts on the importance of parallel execution and how to program parallel integration workflows with my Integration Tag Language. A workflow called a <schedule> often consists of a number of workflow steps called <job>. Here is a ‘bare bone’ schedule with 2 jobs processing a file with 100 rows:

When this schedule executes these jobs execute one by one in sequence. If the jobs are independent of each other it would be better to execute them in parallel, this is achieved by adding the parallel attribute to the jobs, like this:

If job 1 takes 1 minute and job 2 takes 10 minutes then you have not gained much by parallel process the two jobs, the jobs are unbalanced from a parallelization view. If we break down job two into 10 parallel sub processes the 2 jobs are balanced. You do this by introduce a job iterator and piggyback connect the myiterator to the jobiterator, like this:

Here the jobiterator cuts the file into 10 pieces and parallel execute them all, each chunk is given to a myiterator. In theory the two jobs should now execute in 1 minute and this is what I refer to as parallel balanced workflow. The wall clock time is reduced by 82%, with very little effort.

In real life not many workflows behaves as nicely as this one, you may have to adjust the chunks a bit to achieve about equal execution time.