This is the first post in a series of post about parallel job scheduling.
In computer operations batch or background processing; a workflow is a number of steps where the steps are dependent of one or more of the predecessing steps.
Normally you run these steps in sequence one by one. From the viewpoint of a workflow you check if all prerequisites are satisfied for the first step and then you execute the step, when the first step has successfully executed you repeat the the process for the next step and so on until all steps are executed. In real life it is a bit more complicated, sometimes you like to bump over a step and you have to decide what actions to take if a step execution is unsuccessful. But basically you run all steps in sequence.
In the old days of single CPU computers (or very few CPUs) this was all fine and dandy. If you needed to speed things up all you could do was to run a few non dependent schedules in parallel. And that was OK most of times, since there were not so much data to process. Now this has changed, the data volumes processed in workflows has virtually exploded, and grows much faster the processing powers of computers. This is especially true for Business Intelligence activities, Extract Transform and Load processes may process very large data volumes. Volumes so large there is not time enough to process workflow steps in strict sequence. This leaves you with little choice, in order to have the job or rather the workflow done, steps must be processed in parallel one way or another. Today when multi CPU computers are a commodity we have an opportunity to parallel process steps in workflows. The workflow engine must allow for simple and safe parallel processing of workflow steps. There is no use for a multi CPU computer if you need to be a rocket scientist to set up parallel processing workflows.
Parallel processing in computers is complex. Execute a workflow is a high level process, we do not have to deal with the nitty gritty of low level parallel processing like setting up threads and semaphores, still on a high level parallel is complex. Not only is it fundamentally more complex to supervise two process than one. (Those readers who have had the pleasure of attending two three years old in the playground know what I mean.) All the logics needed for sequential processing of workflow steps, deciding when steps go right or wrong etc. must work for steps in parallel which is significantly harder to manage than the plain sequential processing.
The are some workflow step dependencies very common in batch processing, e.g.
- Step a and b can run in parallel and the subsequent step c is dependent of them and steps d and e is dependent of step c.
- Step a and b can run in parallel, c is dependent on a and cut up in ‘sub-steps’ and step d is dependent on c. Finally step e is dependent on b and c.
In this picture I tried to depict the two workflows. In real life it is a bit more complicated, e.g. what to do when steps go wrong? But most batch workflow dependencies are combinations of these two workflows.
How do you describe these process patterns, logic and dependencies and how do you process them? In the next posts , I will describe how I dealt with parallel processing in my job scheduler and the accompanying ITL language.