2014-01-25

The Data Warehouse goes YouTube



For some time now (6 years) I been pondering upon how to visualize the Data Warehouse.  Lately I been inspired by Derick Rethan’s fantastic OpenStreetMap year of edits.

I started by download zillions of software to be able to create ‘activity images’ of the Data Warehouse. I actually managed to create some png frames based on job logs but to do what I wanted I realised it would take just to much of my time  so I stopped. Then I remembered I had seen an animation of web logs some years ago I googled but didn’t find anything. Then I asked  Andreas if he knew about this animation. ‘Yes you sent me a link about this some years ago check out Logstalgia, I send you the link’.

After downloading another shitload of software I had Logstalgia up and running. It didn’t do what I had in mind, but the visualization Logstalgia does is cool, and t is so much easier to animate the logs this way. First I created a pong-event for each entry in the Data Warehouse job logs. I started at CET 18:00 20140115 and ended the next day at 18:00. One day in the Data Warehouse starts at 18:00 by some house cleaning after the backup snapshot is taken. At about 19:00 data loading for our Japanese plant starts. At midnight loading of the other Data Warehouses starts and job activity peaks around 04:00 - 05:00.
At about 08:00 in morning of 2014-01-15, the Data Warehouse and Qlikview users start coming in and the pattern changes. I think this video is worth viewing in it’s entirety.

I had to ‘time compress’ the jobs otherwise the video had been all too long. All jobs started is seen as a pong-ball in the video, the job bounces as a green success or a red failed.
After the jobs I inserted MySQL slow-query log. These entries are marked slow-query and these pong-balls  bounces with the query time in red like 10.05. Slow queries in a Data Warehouse are likely to happen. If you allow users in creating their own jobs and adhoc reporting joining large tables via Excel and Query you have lot’s of slow queries. As long as they do not affect other too much we do not mind.
Next I inserted MySQL general log. This was more problematic. I could not show all queries there were some six million queries that day, I picked all connects but they were too many, I removed all connects from jobs, leaving connects from Excel users, each connect pong-ball bounces with a blue connect.
Finally I inserted Qlikview session start from Qlikview log. This was a horror, the qlikview log is not meant to be parsed by a bash-script, I hand edited this log with the Kate-editor. The Qlikview pong-balls bounces with a green session.

Then I created the video by piping the ‘log-pong’ to ffmpeg:
logstalgia -s 8 --hide-paddle -g "DATAWAREHUSE,URI=mysql?$,1" -800x480 --output-ppm-stream - ~/logstalgia15.txt | ffmpeg -y -r 25 -f image2pipe -vcodec ppm -i -   -pix_fmt yuv420p -crf 1 -threads 8 -bf 0 logstalgia.mp4
Once again the result can be seen at http://youtu.be/bTY9IgJd8kA.
I did some mistakes along the way. First was to start a 18:00 this caused lots of extra work most notably with Qlikview since these logs are organised in dates. Second I used Bash-scripts instead of PHP or any other decent programming language.

This might look like it was easy, but for me as a complete ‘video newbie’, it was not easy, just to make this log-pong flow look reasonably well took a long long time. I’m very happy with the result.

At last the Music; at first I had Verdi’s Thriumpal March in mind but it was to short, so I picked Beethoven’s Final of Eroica instead, not a bad substitute.