12D: PHP parallel job scheduling

In the previous post on Parallel processing of Workflows , I introduced job iterators as a tool for splitting up a job (workflow step) into smaller pieces and parallel process them. In this post I describe iterator chunking and piggyback iterators.

If a job consists of repetitive task like producing a mail to all Vendors or extracting information for all Parts from an external ERP system you can create a job iterator and run a job once for each row in the job iterator like the example in the previous post. By cutting up a job into smaller pieces and parallel process them you may cut down wall clock time. But when the setup time for such a ‘job-step’ is significant, it may take longer time to process these steps individually even if you parallel process them. We need to balance setup time loss and job-step cycle time to achieve minimum wall clock time for the job. Such production control in the real world can be very complex with loads of parameters to consider, here most parameters are fixed or given, and the target minimum wall clock time is simple to monitor. Better still we don’t have to find the very optimum either, just about is more than good enough. It’s cheaper to buy more memory, bandwidth etc, than to squeeze out the last microsecond of your computers, just about will do fine.

An example; to extract data for all materials in a workshop from an ERP system you need to log in to the ERP system, run extraction transaction(s) for each material and lastly log out from the ERP system. Most likely log in and out of the ERP system will take longer time than extract the data for one material. You also have to consider how many tasks you can throw at the source system at one time. And how long will the source system let a single extraction session run, there is always a transaction time limit, when the limit expires you are kicked out.

These are the parameters we have to consider:

Setup time - log in and log log out
Process time - extract information for one part
Number of simultaneous tasks - sessions you’re allowed to run in parallel
Session time limit - how long one session can live

The very reason for cutting up the job is to minimize wall clock time, so optimal batch size(s) is the size that allow you to process your workload in shortest time. In my example we have 3600 materials. Let’s say it takes 1 second to log in and out of the ERP system, and 1 second to run the extraction, so run this sequentially in one session it will take about 3601 seconds. We like to shrink this to a minimum and we know the ERP administrator will be upset if we run more than 10 parallel sessions. If we run our 3600 materials in 10 streams it will take us some 720 seconds to complete the task. If we instead cut the 3600 material into 10 batches of 360 material in each it will only take us 361 seconds to process all materials. That’s some improvement! And some considerable less strain on your IT infrastructure (including the ERP system).

Let’s suppose the ERP session time limit is 240 seconds, to keep it simple and being conservative we then create 20 batches with 180 materials in each, then it will takes us some 362 seconds to complete the task. This is an increase we can live with. Real life experience has shown me it is almost this simple to rightsize job-steps. As long as you do not saturate any part of the infrastructure, the parallel execution will be stable and predictable. Often you can rightsize by looking at the watch and run some test shots.

Now to some real testing, it’s always a challenge to measure parallel processes, since the wall clock time depends on so many things, my testing here is done in a production environment and my figures are ‘smoothed’ mean times from many tests.

If we beef up the example from the previous post and run 100 rows instead of 4 and add 1 second of work for each row and 1 setup second, it takes 104 (100 * 1 + 1 + 3) seconds to run the job sequentially:

100 - number of sequentially processed job-steps
1 - second for each work-step
1 - second setup time
3 - overhead for the job scheduler, database manager, network etc.

If we instead of 1 second setup time for the job have a 1 second setup time for each row the job duration time will increase to 204 seconds. This should come as no surprise, 100 * (1 + 1) + 4. This is to prepare for parallel execution of all job-steps.

The modified example from the previous post with setup time for each job, so we can run all in parallel.

When I run all 100 rows in parallel it finish in 4 seconds! 1 * (1+1) + 2.

However if we are splitting up a job that start sessions against an external ERP system, you will not be permitted to start 100 heavy duty parallel sessions you may get away with 5 sessions though.

Running the same job with 5 parallel workers it finish in 42 seconds 20 * (1 + 1) + 2.

So far minimum allowed wall clock time is 42 seconds. But there a problem with this setup. First we run the setup for every row or job-step, this is like log 'in' and 'out' of an external system once for each job-step, we want to chop up the 100 rows into 5 chunks log in and out once for each chunk and the sequential process each row within a chunk. That is more elegant and should be faster, and more important that would give a natural point in the the process to express log in and out logic to the external ERP system (at the beginning and end of each chunk). And this is what iterator chunking and piggyback iterators allow us to do. In this first example:

the job A <forevery> iterator is cut up in 5 chunks, which runs the nested job B 5 times in parallel. In job B we have a piggyback iterator that is given one chunk each from the job A iterator. The piggyback iterator has an <init> to simulate the setup work. With <init> and <exit> you declare setup and cleanup logic for each chunk. This modified example runs in 23 seconds (20 * 1 + 1 + 2). This example uses a nested job, the ‘real’ purpose of nested jobs is to generate jobs on the fly, the example itself is a bit more complicated than it has to be, but it shows how you can build very complex logic for each chunk. A more succinct way to utilize a piggyback operator is to declare a <mypiggyback> iterator directly in the same job as chunked iterator like this:

Now we only have one job but there is a lot under the hood:

the <forevery> creates an iterator with chunks of 20 rows each. and starts 5 parallel workers
the <piggyback> resolves it’s given chunk into rows and then runs the <init>
the job <sql> is executed once for each row in the piggyback iterator
the piggyback <exit> is executed
when all chunks have finished the forevery <exit> is executed

This example runs in 22 seconds (20 * 1 + 1 + 2) slightly less overhead without nested jobs. We started with 104 seconds and ended up with 22 seconds in 5 parallel batches with setup logic for each batch, pretty neat I would say.

And now we are almost at the end of this post. If you study the last example you find it packed with complex logic expressed in relatively simple and succinct XML code. If you can express this more generalized, logical and succinct in XML please tell me.

There is one thing still itching, the final reduction step. Why do a reduction step when we have a potent database manager at hand? Why not throw the reduction onto MySQL:

This example still takes 22 seconds to run, but we do not have to take care of the result, we got it automagically in a nice database table.

These corny examples illustrates the capability of the piggyback operator well and gives some insights what can be achieved by connecting iterators, but they are far from real word examples, this example and this however is real world. Here you have some more cool examples.

In the next thriller post I discuss templates and how to use them for parallel processing.

2013-11-18

PHP parallel job scheduling - 4

No comments:

Post a Comment