12D: 12/1/13

2013-12-31

Next Year

Another year has passed, a new is to come.

For me 2013 was a busy year, I moved to Corporate IT as Lead Information Architect. In a way I’m back where I started forty years ago, at the Corporate IT department. At this time last year I had planned my Calendar for 2013, with the new job I had to change my plans. Last year I was very involved in Atlas Copco Industrial Technique’s Business Intelligence system. My job was operational and I had to plan accordingly. Now my job is more strategic and I have to plan for that. More meetings, less ‘real’ job. I had forgot, meetings are very hard work. One or two 1 hour long meetings a day is ok, but participate in telcos the entire day keeping focus is exhausting, I’m tired at the end of the day, just to start all over the next day. But I do not complain my problems are very much luxury problems, lot’s of hard work. It is still hard times and many wish they had my problems.

Last year I thought, 2013 would be a year of Business Intelligence and I was right, I have done and still do to a certain extent work with BI, but to that I added IT-architecture and Master Data Management.

One Business Area within Atlas Copco, Mining and Rock Excavation Technique, has started a master data management project, next year I will join this project and try to help making this system global for all Atlas Copco.

Next year I hope to work with Business Intelligence on group level. First I need to understand all different BI initiatives and systems in Atlas Copco since there is little coordinated activities in this areas today. There are strategies and policies in place for the Atlas Copco Group, but this is nothing I will write about until I have a better understanding of them and I know where we stand.

I have failed to make people in the business realise how much Master Data Management can benefit from Business Intelligence, I will do my best to convince my colleagues next year. This is something I feel strongly for and it combines my two most important missions BI and MDM for next year. As an IT-architect I will participate in work with Data Design Authority and reference architecture for Master Data Management.

On a more personal level I will finally start learn Javascript and Node.js. I already have started to build web services, all this is new to me, I been mucking around with some web programming before but nothing serious. Then I hope to start up either Perl6 or D programming, both languages interests me a lot. I’ve muck around a bit with these languages but nothing serious.

I’m overly optimistic of what I can achieve in future and the plans for next year is no exception. Simple things planned often shows to be complicated, and more complex things planned shows to be very hard, it is seldom the other way around. New high prioritized tasks will show up.

My new year promise for 2014.

I will cut down blogging. I write far to many blog posts. I aim for twenty posts next year. But I use this blog partly for documenting so it’s a bit dependent on how much I do which I consider worth a document.

2013-12-26

Parallel processing of workflows - epilogue

I have written a number of post on parallel execution of computer workflows, in the beginning I thought I would write one or two posts. After the first two posts I wrote five more on parallel job scheduling with PHP. Parallel processing and parallel programming is becoming more and more important as the demand for information processing increases more than the sheer speed of the single CPU. Single threaded ETL-processes can no longer keep up with the increasing volumes of information.

When I started to write on parallel execution I was of the opinion parallel was a last resort you only use when single execution of ETL-processes was too time consuming. The reasons were parallel processing is complex, they are error prone and it is hell to analyze and fix parallel workflows when they go wrong. But while writing these posts, creating examples and testing I realised how simple it is to express parallel workflows with the Integration Tag Language (ITL), even the complex process of right-sizing batches of input and process them against remote source systems is simple with piggyback job iterators. ITL also executes parallel workflows in a consistent and predictable way. And now it seems I have also exaggerated the problems of fixing failed parallel workflows. With a little bit of planning upfront you can most often write self healing ETL processes, i.e. if they fail rerun them until they end successfully. So why not parallelize workflows from start?

The information volumes will continue to grow faster than the speed of computers and to be able to keep up we have to parallelize processes. We will soon upgrade the host server from running four parallel threads to twenty threads, and if needed we can add twenty more. This is by today’s standards many threads, but I do not think we have to wait many years to see many more parallel threads in servers. To make efficient use of all those threads, we have to parallelize the computer processes.

I believe we are stuck for the next ten years (at least) with present silicon technology for computer processors. The alternatives researchers are working on today photonic and quantum processors are nowhere near to empower a server near us. Graphene may change this allowing us to take a quantum leap towards quantum computers. While we are waiting for this new brave world, we will see a gradual increase of parallel threads in our servers.

Graphene

Year 2008 I predicted Solid State Disks would have replaced Hard Disk Drives in servers 2011. This didn’t happen a recession came in between. But if I would have kept on building the hardware for my Data Warehouse, the servers would already been equipped with SSD, making them a bit faster. RAM memories are getting bigger and cheaper allowing us to keep large portions of data in RAM, which makes the servers much faster. But the Business Intelligence database models still used where data is stored in complex arrangements of dimension and fact tables, cripples the servers to reach their full potential. We need simpler database models to fully exploit the possibilities of large memory servers. These simpler database models will make it even more rewarding to go parallel.

In this serie of blog posts I have described how I (or rather WE, since this has not been possible without the cooperation of many and long discussions how to do things.) implemented and benefited from parallel workflows. I hope I will have the chance to write a post or two about how WE envision the future of Business Intelligence in the company. And maybe present the present BI crew.

2013-12-25

Hendrick's Gin and Christmas soda.

Yes it worked, yesterday’s rearrangement of my twittering job resulted in this:

The tweet is truncated, but nicely so

In my post yesterday I also wrote about swedish Christmas traditions and some not so well known connections to USA. I forgot julmust. During Christmas Coca Cola sales drops some 50% due to our habit of drinking julmust at Christmas. If I recall this right; Coca Cola tried to overthrow julmust as the non alcoholic Christmas drink with massive Christmas campaigns at the turn of the century, but that backfired many of us thought this was hostile to our traditions. Coca Cola then tried to minimize the damage by launching a ‘there is room for two drinks on the Christmas table’ campaign but it didn’t help, not many swedes drink Coke during Christmas. I drink a lot of Coke but not during Christmas, I and many with me think it bad taste to serve Coke during Christmas, the initial Campaigns made us not drink Coke during Christmas, before we just drank julmust during Christmas. These days Coca Cola produce their own julmust without much ado. We were many who could have told Coca Cola not to try to replace julmust, if you are a big multi-national company on the consumer market you should be very respectful towards national traditions, that’s good business. At the time for their first campaign Coca Cola didn’t listen to the market, if they had they would have stopped the campaign in the initial stage. Today I guess they have a department for damage control which carefully listen to social media like Facebook and Twitter. Not only have the social media given the consumers more direct power, it also have given the producers a chance to better respond to the consumers will, making the world a better place.

Last month my boss asked me over to his desk and asked me if I could spot something unusual.

I noticed he smelled but I answered no I can’t. Then he asked me can’t you smell cat pee? I’t is my new laptop. It’s horrible. The new laptop had a distinct and rather heavy smell of cat urine. Later that day my boss told me I’m not alone and showed me some web posts, complaining about the smell from their new laptops and asking questions like do you have cats in the factory? After some very bad advises from service technicians, Dell quickly replied everywhere: "The smell is not related to cat urine or any other type of biological contaminant, nor is it a health hazard. The odor was a result of a faulty manufacturing process that has been changed. We will replace all faulty parts". (Dell sent over a service guy who fixed my boss laptop.)

My son who study marketing strategies at the university said Dell handled the problem well, they acted according to the school book. He also said they sure have a department for damage control scanning the web. These days a ‘cat pee’ problem can snowball in social media, you have to act fast and take the problem seriously, otherwise you ultimately can risk the existence of a company.

Gin Julmust in american disguise

Yesterday I got a ‘drink recipe’ for a Christmas drink Gin & Julmust. I poured myself a big one when I got home last night, it tasted awful I didn't finish the drink. A waste of both ingredients.

2013-12-24

Yule Tide twitter

I recently have created a job that twitters. This is how it looks like:

Now, if you want to tweet from other <schedules> it might be convenient to create a job template so you do not have to repeat yourself more than necessary like this:

Now you can rewrite the schedule above like this:

Instead of declaring the entire job we include the job template twitter.xml and just declare the @tag TWEET.

And this is what I have done with my twittering job:

But there is a little snag, since I include a file in the message I have to resolve the file first in the MESSAGE tag and then append it to the TWEET tag. Let’s see on Christmas day how it looks.

Today is Christmas eve, and that’s the day we celebrate here in Sweden, tonight jultomten (the Yule Gnome/Santa/Father Christmas) comes with presents to all on the nice list, and most of us party together with the family ( I have baked a cake for 18 persons for tonight’s party). But first we all watch Donald Duck christmas special cartoon. The american (Coca Cola) Father Christmas image actually has a swedish origin. On the other hand the swedish poem we swedes most associate with Winter and Christmas, ‘Tomten’ (the Gnome) by Victor Rydberg is inspired by E.A. Poe’s ‘The Raven’! Completely different story though. By the way do you know that Poe first contemplated a parrot, but found it ridiculous and replace it with a raven.

I wish you all a Merry Christmas:))

2013-12-23

Twitter automation

Some time ago I wrote a post about twitter from the Data Warehouse. I actually have had some problems with the twitter, I failed to run it via Cron and I didn’t have the time or interest to analyze the problem. But yesterday I did. It is almost always a problem with the ‘env’ when you can run a process in the normal shell but not from Cron, so I var_dump(system(‘env’)) in my PHP script and compared and sure enough, the Cron environment missed a proxy definition. So this morning the first truly automatic tweed was heard from the Data Warehouse, which you can follow on https://twitter.com/@tooljn for the time beeing, we should probably find a better name for it.

Here is the process scheduled by Cron:

There is a problem with this tweet, the number of mysql queries are from the day before. I take the status from Mysql at 17:30 when the backup is run with this shell script:

So the query figure in the tweet is a bit unsynchronized, but who cares :)

2013-12-21

Business Intelligence, the future and beyond

Just a few days after I wrote this post I received a mail promoting a new BI system from one of the big guys.

Excerpt from the admail.

The first two items are almost identical with two sentence I wrote in a document describing my Data Warehouse some ten years ago. I didn’t wrote anything about harnessed though, and I used the phrase system centric as opposed to my user centric Data Warehouse. And for the last sentence I more than once been baffled by design patterns disrupting the user experience when loading data.

I think this new modern Business Intelligence approach is the way to go, it’s the future and it’s my old Data Warehouse design.

2013-12-15

Some Data Warehouse

I am writing a serie of posts about parallel processing of computer background workflows in general, and more specific how I parallel process workflows with my Integration Tag Language ITL . ITL is not a real computer language, it doesn’t generate executing code, it only process a parse tree, nonetheless it passes for a language in a duck test. I can use ITL to describe and execute processes and I’m happy with that. Some computer systems I created have been labeled ‘not real’ by others, I don’t mind. Once I created the fastest search engine there was in the IBM Z-server environment, it was labeled (by a competitor) not a real search engine. It was a column based database with bitmap indexes and massively parallel search, it kicked ass with contenders, and I was happy with that. I have created a Data Warehouse, it has also been labeled not a real Data Warehouse, it beats the crap out of competitors though, and I’m happy with that. With my XML based Integration Tag Language I can describe and execute complex parallel workflows easier than anything else I have seen, and I’m happy with that too. Interestingly the host language I use PHP, is often labeled not a real language. PHP a perfect match for ITL, it’s so not real.

Not a real Data Warehouse

My not real Data warehouse has a rather long history, it started around 1995 when I was working as a Business Intelligence Analyst. I got hold of a 4Gb harddisk for my PC, I realised I could fit a BI storage on this huge disk. At that time I had read a lot about the spintronic revolution that was about to come with gigantic disks, RAM and super processors to ridiculously low prices. I started to play with the idea of building a future BI system, where you do not have squeeze data onto the disks, where large portions of data could reside in RAM and be parallel processed by many processors.

Simple tables, no stupefying multi dimensional extended bogus schema

The first thing I did was to get rid of the traditional data models , the normalized OLTP, and the denormalized OLAP models, I was thinking denormalize on a grand scale, each report should have it’s own table from which users could slice, dice and pivot as much they liked in their own spreadsheet (Excel) applications. I called these tables Business Query Sets, since in future we would have oceans of disk space , we could afford to build databases as the normal user perceived them, as tables not multi dimensional extended snowflakes or star or whatever the traditional BI storages are called. Have you ever heard a user ask for reports in extended cube format?

Simple extraction, table based full loads

In future data access will be so fast you can extract entire tables, no more delta loads, I thought. No special extractor code in the source system, just simple SQL on tables. Delta loads are hell, the only thing you can be sure of delta loads break, no matter what you been told delta loads fails. In rare situations you need delta loads, you should have good procedures in place to mend broken delta loads, otherwise you will have a corrupt system. As for having special extractor code in each source system, I would probably not be allowed to put in any extractor code in the source system and you lose control spreading out code all over. Special extractor code was never a clever idea anyway.

Tired of waiting for reports?

It’s fast as hell to select * from table . ‘In future I could trade disk space for speed’ I reasoned. And lots of indexes, it’s just disk space. All frequently used data will reside in huge RAM caches. I also envisioned a Data Warehouse in every LAN, since hardware will be cheap you can have BI servers in every LAN . A LAN database server is faster than a WAN application server.

A user Business Intelligence App, not a Business Intelligence System

Not only did I skip the traditional database models, I also scrapped the System , I wanted to build my Data Warehouse around the users, not build a Data Warehouse System. I wanted to invite users to explore the data with tools of their own choice. I didn’t want to build a System BI the user had to log in to and have funny ‘data explorer’ tools straightjacketed onto them.

2001 a start

Year 2001 I could start build my futuristic Data Warehouse. It wasn’t a start I had wished for. Actually no one in the company believed it was possible to build a Data Warehouse my way. I could not get a sponsor, it was only the fact I was the CIO, with my own (small) budget I could start build my Data Warehouse together with one member in my group. I had to start on a shoestring budget. We used scrapped desktops for servers, with a new large 100Gb disk, 1Gb RAM and an extra network card. I only used free software. Payback time for the first version was one month. I soon began to design and build my own hardware and since I used low cost components I could afford large RAM pools and keep much of the frequent data in memory. From a humble start with one user the, Data Warehouse now produces six to twelve million queries (including ETL) a day, it has about 500 users and feed other applications with data, this includes the BI tool Qlikview . From start 2001 until May 2013 the system only has been down at three power outages. When I moved to corporate IT, the managers of the the product company migrated the Data Warehouse from my two of everything hardware design to single hardware , so now we have to take down the Data Warehouse once a year for service. Twelve years of continuous operation, not many system started 2001 have a track record like that.

For not being a real Data Warehouse, it’s quite some Data Warehouse.

Today I’m very happy for the lack of funds at the start. It forced me to think in directions I probably would not have done otherwise. Another piece of happiness; My colleague I started with was completely ignorant (and unimpressed) of Business Intelligence database theories. The few times I complexed the database design he said ‘I will not do that, I create simple tables it is faster and the users want tables’. Then I measured the different approaches and his simpler models were always better.

Been there, done that

These days Business Intelligence vendors talk a lot about Big Data , hardware accelerators, in-memory databases, nearly online reporting etc, I can say with a real pride ‘been there, done that’. I do not say other BI apps are not real, they are real, some really good.

I actually have heard a BI sales representative refer to my Data Warehouse as not real. My Data Warehouse has been called a simple Excel app. I have put in a lot of hard work into my Data Warehouse. Countless nights and weekends of coding, testing and measuring. I’m not happy when the appreciation of all that hard work is ‘a simple Excel app’. On the other hand the feeling of knowing ‘not many people know what I know of Business Intelligence applications’ makes me happy. And now seeing the big guys catching up on ideas I conceived some twenty years ago, makes me real happy.