2012-05-20

User triggered SAP data loading into a MySQL database from MS Excel

A strategy for real time reporting and loading of Business Intelligence Layer.

From an Excel sheet Button, a signal is sent to a Linux ETL system, that starts a process to fetch ‘delta data’ from a SAP system and update the Business Intelligence data storage in this case a MySQL database.

In a  previous post  I have written about the importance of  performant ETL  processes see   When fast is not enough , you may also want to look at extracting data from SAP  and schedule jobs with PHP . In this post  I will take this  a bit further. Here I show how you can do BI real time reporting with the help of a performant  ETL and Visual Basic in MS  Excel.
The example this post  is based on is a P roof of C oncept and there are still details not fully worked out. But testing so far has been successful.

The merge delta load method

A separate business intelligence (BI) system and real time reporting is easier said than done. Since the BI data is physically separated, there will always be latency, the time it takes to transfer data from the source system to the BI layer. The only way to achieve real time reporting is to do the reporting in the source ERP system. But there are strategies to minimize latency and achieve pseudo online reporting, where the latency is acceptable for the users. Normally you pull data from your source ERP systems at regular intervals, e.g. each night, week or month. To achieve realtime reporting you can increase the pull frequency e.g.  extract data each hour gives  better reconciliation than once a day. However the strain on the computer systems and networks increases with pull frequency and this method does not give online reporting.   The merge delta load method  first fetch data from the BI system and then fetch the delta data from the source ERP system and then merge the two data streams into the report. Although this method gives good real time reporting, there are several disadvantages with this method. It adds complexity to the client, since the client must be able to contact the ERP system and merge the report with ERP delta data. Each time the report is requested the delta data has to be extracted from the ERP system. This means all reports suffer from the response time delay of retrieving data from the source ERP system. Since the BI layer is not updated this delta load is growing with time until next BI layer update.
 
Figure 1a                                                                                                         
Figure 1a shows this scenario. If we now add a user as in Figure 1b we now double the delta extraction from the source ERP system as each user have to fetch the data for his/her report(s).
Figure 1b
With a large number of users this method may extract a lot of data over and over again from the ERP system. If we add another source ERP system as in Figure 1c, this solution becomes very messy. Now the BI client must be able to connect to different source systems and extract and merge all data streams with the base report. The merge delta load method is not very elegant and scales poorly.
Figure 1c

Another approach – user triggered Delta Load

In this example I have  taken another approach to real time reporting.  The merge delta load method is complex and requires special client software to delta load and merges different data streams. Another approach  is to run the ETL process more frequent. But this solution will potentially create a lot overhead by running lots of delta load ETL without any purpose at all. Instead of fire off frequent loads we let the user trigger the ETL. This solves the problem of running lots of meaningless ETL data loads, since we only run the ETL on user request. This way all users benefits from the delta load, data is only extracted once. Instead of retrieving delta data from the source ERP systems, the client sends a signal to the ETL system to start a delta load process  and when the ETL process is finished request the report (see Figure 2a).
 
Figure 2a
This request triggered  update will create a report that is almost real time provided the ETL process is fast. Unfortunately this method requires event based VBA programming in the Excel client, which is currently above the skill level of the author. There is a simple yet elegant way to overcome this technical problem, give the user an update button and let the user trigger the update.  
User triggering simplifies the real time reporting process but the workflow can be depicted the same way as request triggering, the only difference is data extraction from source systems only occur when requested by the user .When we add a user (Figure 2b) we start to see the benefits compared with the merge delta load (Figure 1b).
Figure 2b
No double delta loading since all data extraction updates the common BI layer only once, no matter how many users is added. Adding source systems does not change the picture much, see Figure 2c. We still got a simple and clean real time reporting process.  On the other hand user triggering requires a performant ETL process otherwise the response time will be slow and the report will be old and not real time.  User triggering also requires strict synchronization isolating of individual ETL updates. I will show how I have implemented user triggering for real time reporting.
Figure 2c

User triggered Delta Load  – an   example.

In this example we real time report COPA data from a  SAP system. The reporting is done within an Excel sheet.

The sheet

Figure 3
This is what the user sees when the Excel sheet is invoked. When the user trigger button is pressed a signal is sent to the ETL  to run the delta load  process (see Figure 2a). The ETL status field shows the start, stop and outcome of the last ETL run [1] . The ETL status field is ‘click’ sensitive, click it and the status is updated. In this prototype the ETL process is controlled manually, press the ‘Extract from SAP’ button to fire off the ETL process, click the ‘ETL status’ field to update the status. This is complex real time reporting reduced to a simple, practical and scalable solution. A button and a field    gives the user the power of real time reporting. (There is a lot more to this Excel sheet but that is outside the scope of this post .)

Under the hood

 There is more to it than the button and the field, there are a lot going on under the hood.  The status field is simple, when clicked it just retrieves the status from the ETL system  via a simple ODBC SQL request. The ‘extract’ button is trickier it must contact the   ETL engine which resides in a Linux server and send the signal ‘run Delta Load ’. First we establish a Secure Shell  connection to the Linux server using the Open Source PuTTY program Plink  then we just send the signal ‘run Delta Load’  and disconnect from Linux. The Excel  VBA code  that does the trick can be found in Appendix A.
The Delta Load  procedure is too large to discuss here. In Figure 4 you can see an abbreviated version outlining the job steps.
Figure 4
There is one line of special interest in the procedure, the <prereq type=’singleton’>  restricts the execution of the procedure, it must run alone. If a sibling process already runs this instance dies immediately.  Look at Figure 2 b . If both users send the ‘run ETL’ signal simultaneously one must yield, and this is what this <prereq> does. You can queue the request up by adding a wait to the <prereq>. Since one ETL run will serve all users, it is not necessary to run more than one ETL at a time. This is a very important  simplification, simultaneously updating of large amounts of data is not trivial, and it requires complex logic and slows down the execution.

Response time

The Excel sheet (Figure 3) is a report generator that slices and dices the SAP COPA data. Normally the response time is around one second for generating a report, this is quite a feat since there are a lot of COPA data. This is achieved by ‘in memory computing’. We try to keep data in memory by various caching techniques. When the COPA data is not in memory there may be an initial less than a minute delay [2] .  Since the ETL load time is added to the report response time (if you need an updated report) it is essential to keep the ETL load time low. ETL load statistics (see Appendix B) gives you can normally ETL load 2 to 3 thousand rows in 15 seconds. Two background ETL load jobs a day will keep maximum rows for additional user triggered ETL load below three thousand rows. This is probably a well balanced ETL load scheme. But in the end it is up to user to decide, it is simple to change the ‘background frequency’.

Appendix A – Excel VBA code sends ’run Delta load ’ signal.
Private Sub CommandButton2_Click()
    Dim resVar As Variant
    commspgmP = "C:\Program Files\PuTTY\plink.exe"
    commspgm = """" & commspgmP & """"
    'MsgBox commspgm
    If Dir(commspgmP) = "" Then
        MsgBox "You must install " & commspgmP & " to update BI!" & " http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html"
        Exit Sub
    End If
    hostuser = " -v -l viola -pw xxxxxx DWETLserver"
    cmdtxt = " /home/tooljn/dw/pgm/scriptS.php schedule=COPA_DeltaLoad02.xml logmode=w onscreen=no"
    resVar = Shell("cmd /c " & commspgm & hostuser & cmdtxt, vbMinimizedFocus)
'MsgBox commspgm & hostuser & cmdtxt & "  -  result=" & resVar
    Sleep (2000) ' Sleep for 2 seconds to allow BI update to commence before we...
    Call displayLastUpdate
End Sub
Comments to the code:
This is my first piece of VBA code and I have trial&error it until it worked, I do not know VBA well.
 The line hostuser = " -v -l viola -pw xxxxxx DWETLserver"   is not very elegant. You should  not have hard coded credentials! However in this case it’s benign, the user viola  cannot do anything else but run our ETL process COPA_DeltaLoad02.xml. Going into production this will be replaced by certificates. The user ‘viola’ is in honor of my late and very beloved granny Viola. She was from north of Sweden, her father was chief engineer of the Boden fortress  and an inventor. The German Wehrmacht tried to enlist him, but he declined, which I am grateful for, that made it possible for my grandfather (an orphan with no means from south of Sweden) to meet Viola. He served as a sergeant at the fortress. After marriage and my mother were born he had a motor cycle accident and he was relocated to Stockholm, administrating arms purchase at the Army HQ just 100 meters from my present flat (I’m looking at the building while writing this). I’m grateful for the accident because that made it possible for my father to meet my mother.  At that time he was an accountant, a job now generally replaced by Excel sheets.

Appendix B  - ETL De lta  load times

 The table in Appendix B shows statistics from real ETL runs. Here you can see that the example in this article took 5 minutes and 45 seconds to run and the performance was ~310 rows per second. The load velocity is depended on number of new lines (delta data) , concurrent network load  and the responsiveness of the source (SAP) system . The setup time [3]  is low, normally well below one second.  

   

To keep average response time acceptable we should probably schedule two back-ground ETL runs a day at 07.00 and noon for the (SAP COPA) data in this example. These ETL runs will have a negligible impact on the infrastructure landscape and keep user triggered loads to a maximum of a few thousand rows.  
Before you schedule the background jobs you should analyze data update pattern i.e. when updates occur and schedule on low activity hours like 07.00 and noon.

[1]  This particular ETL run took some 6 minutes. See Appendix B for details.
[2]  These figures are not measured, it is the authors biased opinion.
[3]  This includes send ‘run ETL signal’ and start up the ETL job in the BI layer. This also includes detection if there is anything to load, this is important you do not want to aggregate nonexistent data and reload already up to date BI cubes.

No comments:

Post a Comment