2014-06-06

Why are ETL procedures so complex?

Of course I think my way is better, otherwise I would have walked another way.


The last couple of years I have looked into some modern ETL products. One thing that strikes me is the complexity of these packages, or rather setting up ETL processes are in my view overly complex. These systems have drag a drop point and click graphical user interfaces. You build workflows by dragging job steps into a pane and connect them by arrows, connectors etc, I think you got the picture. This looks very appealing, with just a few mouse clicks you have created an ETL workflow, no dull programming at all. But this is just an empty skeleton, which you have to dress up with real functionality, often in a clunky scripting language specially crafted for the ETL product.Then you need to define meta data and map source data with the ETL datastore, (which most often are Data Warehouses). This mapping is typically done in stages gradually refining the data into extended super duper information cubes. These mapping steps are often a combination of drag and drop, click and insert clunky code, which can be surprisingly complex, and the clunky code is spread all over. SQL which still is the Lingua Franca of data stores is often hidden away from the workflow developers, in favor of a clunky scripting language. A human analogy is learning some esoteric dialect of klingon instead of english. The analogy limps, but I think you get the idea.
SQL is a precise succinct and mostly logical language (it got it’s quirks I know). Replacing SQL by a clunky language and some graphical symbols is not good for anyone, except those who thrive on the ETL tool.


One big problem for enterprises today is analysing the information in their ERP system. The first hurdle is to grab the information and get it out of the source ERP systems, so it can be massaged and imported into a Data Warehouse. Having a clunky tool to extract data from the source system does not help much. I think those of you who works with this, recognize it takes days if not weeks to get a new simple piece of information from request until it is usable in the Data Warehouse. It’s not only the complexity of the ETL tool who is to blame, the entire governance of these business- and IT-processes seldom supports agile and rapid development.
I’m not pointing fingers to any product or organization, it’s more and dissatisfaction with how ETL tools and corporate organisations work in general.


This post emanated from a discussion I had with some attendees of a large international Business Intelligence user meeting and a case of missing country information in our Data Warehouse.
I needed the country names for a report and to my stupefaction I found this info missing in the Data Warehouse. I do not like hard coding so I looked around for country info.Via Google I found the info I needed in SAP table T005T. Then I search for an existing Data Warehouse workflow that could act as a template and I found this one:


<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<schedule notify='admfail.xml' logmsg='SAP Plant/Branch Info'>
  
<variant name='sap' validate='acta_prod,acta_test' default='acta_prod'/>
   <tag name='DW_DB' value=’LOGISTICSBI'/>
   
   <job name=’PlantInfo' type='script' pgm='sap2.php'>
<tag name='SAPTABLE' value='T001W'/>
<sap>
   <rfc>
<name>Z_LJ_READ_TABLE</name>
<import>
   ('QUERY_TABLE','@SAPTABLE')
   ,('DELIMITER',';')
</import>
   </rfc>
</sap>
<sql>
   <autoload>replace</autoload><database>@DW_DB</database><truncate>yes</truncate>
   <table>@SAPTABLE</table>
</sql>
   </job>
</schedule>
  
This workflow doesn’t look sexy I know, you may even call it clunky but to me it is just plain simple, and I find simplicity beautiful, I call it ITL, an XML based language designed for computer backend or batch workflows. Two features of ITL, it’s text and a workflow is contained in one script so it’s easy to create new workflows by copy. We seldom create new ITL scripts from scratch, we copy from old workflows to create new ones.


A bowl of spaghetti is a bowl of spaghetti, no matter if it is code or graphical connectors.
To be able to develop in  ITL, you need a basic understanding of XML and you should be well versed in SQL, ITL per se is small and simple (it got it’s quirks though). As a workflow control language it is immensely powerful.
To understand the SAP connector used in this example you got  to have a good understanding of  SAP and SAP Remote Function Call and how those functions are structured.
To be a successful ITL developer you need programming abilities, and this is very deliberate. Over the years I found programmers superior constructing logical and coherent workflows. This should come as no surprise programmers are trained for that. When I search for Business Intelligence developers I search for business people with programming abilities, this is probably the prime key factor for the success of the Data Warehouse.  


Anyway I copied and modified the workflow above like this:
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<schedule notify='admfail.xml' logmsg='SAP Country Info'>
  
<variant name='sap validate='acta_prod,acta_test' default='acta_prod'/>
   <tag name='DW_DB' value=’MASTERDATA'/>
   
   <job name='CountryInfo' type='script' pgm='sap2.php'>
<tag name='SAPTABLE' value='T005T'/>
<sap>
   <rfc>
<name>Z_LJ_READ_TABLE</name>
<import>
   ('QUERY_TABLE','@SAPTABLE')
   ,('DELIMITER',';')
</import>
   </rfc>
</sap>
<sql>
   <autoload>replace</autoload><database>@DW_DB</database><truncate>yes</truncate>
   <table>@SAPTABLE</table>
</sql>
   </job>
</schedule>


logmsg I changed to SAP Country info
The DW_DB I changed to MASTERDATA
job name I changed to CountryInfo
SAPTABLE I changed to T005T


Then I saved this new workflow as rfcReadT005T.xml.
Finally I run the new workflow schedule manually:


And this is the result:


The entire development and execution of this data extraction from SAP took less than 15 minutes. It is a very simple workflow, yes absolutely, but for most organisations 15 minutes is still fast; very, very fast. And extracting the data from SAP which most people consider ‘closed almost impossible to extract data from’. SAP is my system of preference when I need data. SAP is very open.
If you look closely at the workflows you find a line:
<variant name='sap' validate='acta_prod,acta_test' default='acta_prod'/>
This is a runtime parameter definition (sap) which defaults to ‘acta_prod’, which points to a SAP production system. We do most Data Warehouse or Business Intelligence development in production environments. If you want to be agile, you do not have time to transport programs from test over quality to production. When we need a Data Warehouse test environment we create an ad hoc data mart for the purpose. We address the SAP test environment by adding “sap=acta_test” when we start the workflow.


You can also see that there is no regular SQL statements in this workflow, mapping source data into Data Warehouse SQL tables is a tiresome process we gladly avoid, but there is nothing stopping you from importing the data with regular SQL, sometimes we do, there is always edge cases our automatic import does not support. This is important; regular SQL is always supported.


Finally I scheduled this new workflow rfcReadT005T.xml for monthly execution by inserting a new line in a monthly Cron shell script:


#!/bin/bash

echo "starting mngr_month_rfc"
date
pwd

nohup ./scriptS.php schedule=rfcReadT005T.xml logmode=warning onscreen=no &

./scriptS.php schedule=Quota_month.xml logmode=warning onscreen=no
./scriptS.php schedule=control_month.xml logmode=warning onscreen=no
nohup ./scriptS.php schedule=jm2ordstockmonth.xml logmode=warning onscreen=no &
./scriptS.php schedule=PURCHASINGBI_MONTH.xml logmode=warning

#porder extra for Tierp factory monthly closing
./scriptS.php schedule=linux_db3prodorder.xml logmode=warning
./scriptS.php schedule=Siewertz_report_2.xml logmode=warning onscreen=no
./scriptS.php schedule=Siewertz_report.xml logmode=warning onscreen=no
./scriptS.php schedule=purchasing_paydays_update.xml logmode=warning
./scriptS.php schedule=purchasing_paydays_updateCB.xml logmode=warning
./scriptS.php schedule=mail_CZT_NOTSC_MONTH.xml logmode=warning
./scriptS.php schedule=mail_IYI_STAT.xml logmode=warning
./scriptS.php schedule=mail_NORDIC_MONTH.xml logmode=warning
./scriptS.php schedule=cb1artstatinventory_month.xml logmode=warning  
./scriptS.php schedule=db3artstatinventory_month.xml logmode=warning
./scriptS.php schedule=henriktest_op_registeraccmonth.xml logmode=warning
 
#Dont forget to set correct prereqs when adding a schedule here
echo "script completed"


This took me another 5 minutes to do since I’m a very slow typist.

This is a tiny bit of month end closing activity, here you can see a regular days night activity, and here you can see some stats, and here you follow the progression of the Data Warehouse.


p.s.
I will use the new country information for a new small app, I hopefully blog about this later on. This new app is more interesting than the dull extraction workflow shown in this post.

2 comments:

  1. Lars, interesting review - I do believe in this agile approach you took. Mind you, we are in the minority :) I am also glad to have learned about lostalgia - it will certainly appear in some of my apps. Though I would still like to point you to proper integration approaches which, if fine-tuned, would get rid of this ETL thing altogether. IMHO, good integration in the background, good data model and loose binding of lookups (such as the mentioned country list) would solve many issues of this type. But as you said - agile is the key here as well.

    Petr

    Petr

    ReplyDelete
  2. I cannot but agree Petr.
    I like small apps with rigid interfaces glued together with rubust integration schemes, like Lego blocks. If this is well done ETL would not be needed.

    I wait for you to start blogging Petr:) You have great ideas and opinions on most things and write elegant.

    ReplyDelete