2013-10-23

PHP parallel job scheduling - 1

In the previous post parallel processing on workflows-2  I described how to parallel process entire workflows with my jobscheduler . In this post I describe parallel processing of individual workflow steps within a workflow and how to achieve parallelism within PHP.
I’m no expert on parallel processing in Linux, but PHP ‘parallel functionality’ seems limited, primitive and awkward to me. I do not know how multithreading is implemented in PHP or PHP’s parallel performance. While writing this I found this post . I have implemented parallelism in a PHP on a massive scale by pcntl , and it works pretty well. Kore Nordmann  have wrapped PHP thread forking into a neat OO package . I would start from Kore’s example if I had to do it again, but I did my thread forking a long time ago, probably 2006. Instead of showing my code you should look at Kore’s code which (I have not tested it) looks much cooler than my multi-threading PHP code.
(While writing this post I had a new look at pcntl  and it looks like it has come a long way from what I recall. When I got time I will reread the documentation in detail.)
In my PHP job scheduler a workflow is called schedule  and a workflow step is called job . schedules and jobs are defined in XML scripts. A schedule contains zero or more jobs, which can be nested so a job can in addition to itself contain zero or more jobs. In this post I describe how I parallel process individual jobs.
Jobs in schedules can always be run sequentially,  the motivation for parallel process jobs in schedules is to decrease wall clock time, and PHP is probably not the most efficient language for parallel execution, luckily it does not matter much for my job scheduler since we do not want to shrink seconds into hundreds of a second rather hours into minutes and in that perspective PHP is more than performant enough. It’s rewarding to parallel process Business Intelligence integration processes, you can shorten execution time a lot.
If we look at an example, we have two directives that determine processing order of the jobs:
  1. multithread  - ’ no ’ forces execution into one process, ‘ yes ’ each job runs in own process
  2. parallel  - is a hint ‘ no ’ execute sequential, ‘yes ’ try to process in parallel.
 If you run this schedule all jobs run sequentially in the same process, extract from the log:
If we run the same schedule with multithread=’yes’ , we will run in the same order but within their own process:
When we model our schema after the first workflow at the top (magnified in the first post ) :
By changing multithread  to ‘yes’ (default on Linux) and setting the parallel  hint to ‘yes’ for those jobs that can run in parallel we changed the run order from strict sequential to follow first workflow on the top of this post.
Note that jobs ‘d’ and ‘e’ run in parallel after ‘c’, this is enforced by ‘c’ parallel=’no’, which not only means ‘ I run after predecessor jobs ’ but also ‘ successor jobs run after me ’.
This a Dot representation of the schedule, visualised by graphviz . ( I did the Dot generator PHP script on fly some years ago, now it will take me hours to understand it :( You don’t miss the documentation until you need it  :)
In the next post on parallel processing of workflows  I introduce nested jobs to allow for more complex job scheduling.

No comments:

Post a Comment