2013-10-29

Autumn cleaning

While removing some old blog posts I decided to change the appearance of my blog, to give it a more contemporary look. Quite snazzy I would say.

2013-10-27

PHP parallel job scheduling - 2

Workflow steps can always be run sequentially one step after another, we run steps in parallel to reduce wall clock time. Parallel execution of workflow steps makes the workflow much more complex and should be avoided.
So far in my serie on parallel execution  of workflows I have described parallel execution of entire workflows or relatively simple parallel dependencies between workflow steps in my jobscheduler AWAP (Advanced Workflow Administrations Processor), where workflows are called schedules  and steps are called jobs .
Before you read this post you should enjoy the previous post . In this post I will extend parallel execution a little by introducing nested jobs  to allow strict sequential execution within parallel running jobs and submit schedule to start other workflows (a)synchronously .
You can achieve the correct result by sequential execution or setting up prerequisites for individual jobs. But it is much easier (and cleaner) to nest jobs. This is the representation of Workflow 3 in my jobscheduler:
If you compare this schedule with example 1 in part 1 you may notice I stripped away some default directives like parallel=’no’  on job C,D and E. So within job A, job C and D is executed sequentially
, and job A is not done until C and D have finished.
This is a graphical representation of the schedule:
This picture is created by a PHP ad hoc program I did for fun generating dot code for graphviz .  I had planned to publish the code here as an example of bad PHP code but when I looked at it it’s some 500 lines and that is little to much .
This is the log extract from running this schedule:
If you look at the log and follow the execution of jobs you see job A and B starts in parallel, the C and D executes sequentially, after D has finished A finish. When B is finished the external schedule examplePP1 is submitted for execution and job E can start, since it’s predecessors A and B has successfully finished.
By combining the parallel techniques in these ‘parallel’ posts, you can pretty much create any type of parallel workflow pattern imaginable. But why then should parallel (or complex workflows) be avoided (as stated above)?  If you have ever tried to mend a crashed complex JCL mainframe workflow 03:30 in the morning, or a standstill complex SAP workflow you would not ask the question. But real life is complex and sometimes those complexities can not be avoided, and then it’s nice to express those complex execution dependencies in simple yet powerful rules.
In the next exciting post  on parallel processing of workflows, I introduce iterators that allow me to cut up a too big job into right sized pieces and parallel process those pieces. Map and reduce,  event driven parallel execution with queue handling.

2013-10-23

PHP parallel job scheduling - 1

In the previous post parallel processing on workflows-2  I described how to parallel process entire workflows with my jobscheduler . In this post I describe parallel processing of individual workflow steps within a workflow and how to achieve parallelism within PHP.
I’m no expert on parallel processing in Linux, but PHP ‘parallel functionality’ seems limited, primitive and awkward to me. I do not know how multithreading is implemented in PHP or PHP’s parallel performance. While writing this I found this post . I have implemented parallelism in a PHP on a massive scale by pcntl , and it works pretty well. Kore Nordmann  have wrapped PHP thread forking into a neat OO package . I would start from Kore’s example if I had to do it again, but I did my thread forking a long time ago, probably 2006. Instead of showing my code you should look at Kore’s code which (I have not tested it) looks much cooler than my multi-threading PHP code.
(While writing this post I had a new look at pcntl  and it looks like it has come a long way from what I recall. When I got time I will reread the documentation in detail.)
In my PHP job scheduler a workflow is called schedule  and a workflow step is called job . schedules and jobs are defined in XML scripts. A schedule contains zero or more jobs, which can be nested so a job can in addition to itself contain zero or more jobs. In this post I describe how I parallel process individual jobs.
Jobs in schedules can always be run sequentially,  the motivation for parallel process jobs in schedules is to decrease wall clock time, and PHP is probably not the most efficient language for parallel execution, luckily it does not matter much for my job scheduler since we do not want to shrink seconds into hundreds of a second rather hours into minutes and in that perspective PHP is more than performant enough. It’s rewarding to parallel process Business Intelligence integration processes, you can shorten execution time a lot.
If we look at an example, we have two directives that determine processing order of the jobs:
  1. multithread  - ’ no ’ forces execution into one process, ‘ yes ’ each job runs in own process
  2. parallel  - is a hint ‘ no ’ execute sequential, ‘yes ’ try to process in parallel.
 If you run this schedule all jobs run sequentially in the same process, extract from the log:
If we run the same schedule with multithread=’yes’ , we will run in the same order but within their own process:
When we model our schema after the first workflow at the top (magnified in the first post ) :
By changing multithread  to ‘yes’ (default on Linux) and setting the parallel  hint to ‘yes’ for those jobs that can run in parallel we changed the run order from strict sequential to follow first workflow on the top of this post.
Note that jobs ‘d’ and ‘e’ run in parallel after ‘c’, this is enforced by ‘c’ parallel=’no’, which not only means ‘ I run after predecessor jobs ’ but also ‘ successor jobs run after me ’.
This a Dot representation of the schedule, visualised by graphviz . ( I did the Dot generator PHP script on fly some years ago, now it will take me hours to understand it :( You don’t miss the documentation until you need it  :)
In the next post on parallel processing of workflows  I introduce nested jobs to allow for more complex job scheduling.

2013-10-14

Parallel processing of workflows - 2

In the first post on parallel workflow scheduling  I described the need for parallel execution of steps within a batch process. In this post I describe parallel processing in my jobscheduler  in general and how to parallel process workflows.
My Integration Tag Language (ITL) is based on XML, and using XML as a programing language has it’s challenges, I remember IBM did something bizarre in their ispf environment beginning of the 90ties. Keep the XML ‘light’, i.e. do not try to solve complex logic in XML. Use XML to describe the overall workflow only, for the detailed logic use a ‘control language’. Do not create your own ‘control language’, but rely on host languages in my case PHP and SQL, since I store all data in MySQL it’s very natural to utilize SQL as much as possible for data manipulation.
 I insert the source data as soon as possible into the database and use the full power of SQL (and accompanying macro language if needed) to transform and join the data. One significant advantage, working in my job scheduler you become proficient in SQL (and to a certain degree PHP and XML) and not in some obscure ‘control language’, used for transformation and joining all kinds of data sources and first in the last step update the data store.
The motivation for parallel process steps in workflows is to decrease wall clock time, and PHP is probably not the most efficient language for parallel execution, luckily it does not matter much for my job scheduler since we do not want to shrink seconds into hundreds of a second rather hours into minutes and in that perspective PHP is more than performant enough. It’s rewarding to parallel process Business Intelligence integration processes, with massive  above I didn’t mean very many parallel threads, but massive  amount of work, you can easily beat the shit out of powerful servers with thirty or so concurrent ETL workflows, not to mention the stress these processes might cause in the source system.
   
Parallel workflow processing in the job scheduler is done in several ways:
  • The jobscheduler is cloned in servers, we run two separate instances of the job scheduler, it scales nicely.
  • Multiple workflows is scheduled simultaneously.
  • Multiple steps are submitted for execution in parallel.
  • Individual steps are split up in smaller chunks and executed in parallel.
Workflows are submitted for execution from Cron, via simple bash scripts:
Here is a bash script submitting workflows. First a workflow jm2daily.xml  is executed. Then a number of workflows are submitted in parallel (nohup &). Lastly there are a number of workflows submitted one by one in sequence. A workflow  is called a schedule . A schedule consists of directives and workflow steps  which are called  jobs . Normally schedules are independent of each other or have a simple sequential dependence.
Here we have a schedule with two interesting directives, the first prereq  indicates a SAP job with the name prefix ‘AM152-SALES_STAT’ must have executed successfully during the last 12 hours otherwise the prereqwait  directive will sleep for 300 seconds and then evaluate the prereq again and repeat for 10000 seconds or until 08.00.00 whichever happens first. You can setup any number of prereq checks to synchronize processing between schedules and other events in the network infrastructure.
Up till now I described the need for parallel processing of computer batch workflows and parallel execution and synchronization of workflows. In the next post PHP parallel job scheduling-1  I will describe how I parallel execute workflow steps something I find more interesting.

2013-10-09

Parallel processing of workflows - 1

This is the first post in a series of post about parallel job scheduling.

In computer operations batch or background processing; a workflow is a number of steps where the steps are dependent of one or more of the predecessing steps.

Normally you run these steps in sequence one by one. From the viewpoint of a workflow you check if all prerequisites are satisfied for the first step and then you execute the step, when the first step has successfully executed you repeat the the process for the next step and so on until all steps are executed. In real life it is a bit more complicated, sometimes you like to bump over a step and you have to decide what actions to take if a step execution is unsuccessful. But basically you run all steps in sequence.

In the old days of single CPU computers (or very few CPUs) this was all fine and dandy. If you needed to speed things up all you could do was to run a few non dependent schedules in parallel. And that was OK most of times, since there were not so much data to process. Now this has changed, the data volumes processed in workflows has virtually exploded, and grows much faster the processing powers of computers. This is especially true for Business Intelligence activities, Extract Transform and Load processes may process very large data volumes. Volumes so large there is not time enough to process workflow steps in strict sequence. This leaves you with little choice, in order to have the job or rather the workflow done, steps must be processed in parallel one way or another. Today when multi CPU computers are a commodity we have an opportunity to parallel process steps in workflows. The workflow engine must allow for simple and safe parallel processing of workflow steps. There is no use for a multi CPU computer if you need to be a rocket scientist to set up parallel processing workflows.

Parallel processing in computers is complex. Execute a workflow is a high level process, we do not have to deal with the nitty gritty of low level parallel processing like setting up threads and semaphores, still on a high level parallel is complex. Not only is it fundamentally more complex to supervise two process than one. (Those readers who have had the pleasure of attending two three years old in the playground know what I mean.) All the logics needed for sequential processing of workflow steps, deciding when steps go right or wrong etc. must work for steps in parallel which is significantly harder to manage than the plain sequential processing.

The are some workflow step dependencies very common in batch processing, e.g.

  1. Step a and b can run in parallel and the subsequent step c is dependent of them and steps d and e is dependent of step c.
  2. Step a and b can run in parallel, c is dependent on a and cut up in ‘sub-steps’ and step d is dependent on c. Finally step e is dependent on b and c.    

In this picture I tried to depict the two workflows. In real life it is a bit more complicated, e.g. what to do when steps go wrong? But most batch workflow dependencies are combinations of these two workflows.

How do you describe these process patterns, logic and dependencies and how do you process them? In the next posts , I will describe how I dealt with parallel processing in my job scheduler and the accompanying ITL language.

 

2013-10-07

Twitter from the MySQL Data Warehouse

Some weeks ago I created a tweeting PHP script . Now I use this script for posting job activity from the Data Warehouse on Twitter. Since the  Data Warehouse jobs are registered in MySQL databases, we had to sum up the figures and feed them into the PHP script twitter02.php.  It turned out to an easy task, using the Integration Tag Language . I decided to test this with a  status message showing how many jobs have run the last 24 hours. Here is the schedule:
The first job ‘crtSQLMsg’ sums up all job activity and pass the result to the sqlconverter_CSV.php  script which converts the result table to a file:
Which looks like:
Now it’s only to post this file with the job ‘twittaMsg’, if you study the job you see how the job status message is prefixed with #DataWarehouse.
If you follow @tooljn at Twitter you have seen this tweet as:
 
I’m very happy with the SQL converter  functionality, which out of the box converted the result table into a readable message, and the  @tag GETSQLMSG  which slurps up the message in the subsequent twittaMsg job.
I end this post with the sqlconverter_CSV.php script:
<?php
/**
* SQL result converter - dynamically included in function {@link execSql()}
*
* This converter converts a SQL select result to a CSV file.
*
* This converter also accepts:
*
* Syntax:  <sqlconverter name='sqlconverter_CSV.php' target='report0' headers='no' delim='space' enclosed=''/>
* 1 delim                field delimiter                 default ';' semicolon
* 2 header        field headers                default TRUE/yes
* 3 enclosed        field enclosed by                default NULL
*
* Note delim ' ' doesn't work for unknown reason, so use 'space' instead. Bug?
*
* @see sqlconverter_default.php
* @author Lasse Johansson <lars.a.johansson@se.atlascopco.com>
* @version  1.0.0
* @package adac
* @subpackage sqlconverter
*/
$metafile = $sqltarget.'meta_';
$metasfx = '.TXT';
$targetsfx = 'CSV';
$fieldDelimiter = ';';        // default
$fieldEnclosed = "'";        // default
$headers = TRUE;        // default
if(array_key_exists('delim',$xmlconverter))
  $fieldDelimiter = is_string($xmlconverter['delim']) ? $xmlconverter['delim'] : $xmlconverter['delim'][0]['value'];
if(array_key_exists('enclosed',$xmlconverter))
  $fieldEnclosed = is_string($xmlconverter['enclosed']) ? $xmlconverter['enclosed'] : $xmlconverter['enclosed'][0]['value'];
if(array_key_exists('headers',$xmlconverter))
  $t_headers = is_string($xmlconverter['headers']) ? $xmlconverter['headers'] : $xmlconverter['headers'][0]['value'];
if ($t_headers == 'no') $headers = FALSE;
if ("$fieldDelimiter" == 'space') $fieldDelimiter = ' ';
if ("$$fieldEnclosed" == 'space') $fieldEnclosed = ' ';
$sqllog->logit('Note',"Enter sqlconverter_CVS.php using target=$sqltarget");
if(is_numeric(substr($sqltarget, -1,1))) {
        $metafile = "$metafile$metasfx";
        $sqltarget = "$sqltarget.$targetsfx";
} else {
        clearstatcache();
        for ($x=0; 1==1; $x++){
                if (!file_exists("$metafile$x$metasfx")){
                        $metafile = "$metafile$x$metasfx";
                        $sqltarget = "$sqltarget$x.$targetsfx";
                        break;
                }
        }
}
if (file_exists($sqltarget)) $fpc_flag = 'FILE_APPEND';
else $fpc_flag = NULL;
$report = '';
if ($fpc_flag == NULL and $headers){;
        $meta = '';
        while ($finfo = $result->fetch_field()) {
          $report .= $finfo->name."$fieldDelimiter";
          $meta .= sprintf("Name:    %s;", $finfo->name);
          $meta .= sprintf("OrgName:    %s;", $finfo->orgname);
          $meta .= sprintf("Table:    %s;", $finfo->table);
          $meta .= sprintf("OrgTable:    %s;", $finfo->orgtable);
          $meta .= sprintf("Default:    %s;", $finfo->def);
          $meta .= sprintf("MaxLen:    %d;", $finfo->max_length);
          $meta .= sprintf("Len:    %d;", $finfo->length);
          $meta .= sprintf("Charsetnr:    %d;", $finfo->charsetnr);
          $meta .= sprintf("Flags:    %d;", $finfo->flags);
          $meta .= sprintf("Type:    %d;", $finfo->type);
          $meta .= sprintf("Decimals:    %d;", $finfo->decimals);
          $meta .= "\n";
        }  
        file_put_contents($metafile,$meta);
        unset($meta);
        $report .= "\n";
}
//  Here comes the working code
while ($row = $result->fetch_row()) {
  foreach($row as &$fld) {$fld = "$fieldEnclosed"."$fld"."$fieldEnclosed";}
  $rowstr = implode("$fieldDelimiter", $row);
  $log->logit('Note',"$rowstr");
  $report .= $rowstr."\n";
}
if($fpc_flag == 'FILE_APPEND') file_put_contents($sqltarget,$report,FILE_APPEND);
else file_put_contents($sqltarget,$report);
unset($report);
$sqllog->logit('Note',"Exit sqlconverter_CVS.php");