2013-12-26

Parallel processing of workflows - epilogue

I have written a number of post on parallel execution of computer workflows, in the beginning I thought I would write one or two posts. After the first two posts I wrote five more on parallel job scheduling with PHP. Parallel processing and parallel programming is becoming more and more important as the demand for information processing increases more than the sheer speed of the single CPU. Single threaded ETL-processes can no longer keep up with the increasing volumes of information.
When I started to write on parallel execution I was of the opinion parallel was a last resort you only use when single execution of ETL-processes was too time consuming. The reasons were parallel processing is complex, they are error prone and it is hell to analyze and fix parallel workflows when they go wrong. But while writing these posts, creating examples and testing I realised how simple it is to express parallel workflows with the Integration Tag Language (ITL), even the complex process of right-sizing batches of input and process them against remote source systems is simple with piggyback job iterators. ITL also executes parallel workflows in a consistent and predictable way. And now it seems I have also exaggerated the problems of fixing failed parallel workflows. With a little bit of planning upfront you can most often write self healing ETL processes, i.e. if they fail rerun them until they end successfully. So why not parallelize workflows from start?
The information volumes will continue to grow faster than the speed of computers and to be able to keep up we have to parallelize processes. We will soon upgrade the host server from running four parallel threads to twenty threads, and if needed we can add twenty more. This is by today’s standards many threads, but I do not think we have to wait many years to see many more parallel threads in servers. To make efficient use of all those threads, we have to parallelize the computer processes.
I believe we are stuck for the next ten years (at least) with present silicon technology for computer processors. The alternatives researchers are working on today photonic and quantum processors are nowhere near to empower a server near us. Graphene may change this allowing us to take a quantum leap towards quantum computers. While we are waiting for this new brave world, we will see a gradual increase of parallel threads in our servers.


Year 2008 I predicted Solid State Disks would have replaced Hard Disk Drives in servers 2011. This didn’t happen a recession came in between. But if I would have kept on building the hardware for my Data Warehouse, the servers would already been equipped with SSD, making them a bit faster. RAM memories are getting bigger and cheaper allowing us to keep large portions of data in RAM, which makes the servers much faster. But the Business Intelligence database models still used where data is stored in complex arrangements of  dimension and fact tables, cripples the servers to reach their full potential. We need simpler database models to fully exploit the possibilities of large memory servers. These simpler database models will make it even more rewarding to go parallel.

In this serie of blog posts I have described how I (or rather WE, since this has not been possible without the cooperation of many and long discussions how to do things.) implemented and benefited from parallel workflows. I hope I will have the chance to write a post or two about how WE envision the future of Business Intelligence in the company. And maybe present the present BI crew.

No comments:

Post a Comment