12D: Parallel processing of workflows

In the first post on parallel workflow scheduling I described the need for parallel execution of steps within a batch process. In this post I describe parallel processing in my jobscheduler in general and how to parallel process workflows.

My Integration Tag Language (ITL) is based on XML, and using XML as a programing language has it’s challenges, I remember IBM did something bizarre in their ispf environment beginning of the 90ties. Keep the XML ‘light’, i.e. do not try to solve complex logic in XML. Use XML to describe the overall workflow only, for the detailed logic use a ‘control language’. Do not create your own ‘control language’, but rely on host languages in my case PHP and SQL, since I store all data in MySQL it’s very natural to utilize SQL as much as possible for data manipulation.

I insert the source data as soon as possible into the database and use the full power of SQL (and accompanying macro language if needed) to transform and join the data. One significant advantage, working in my job scheduler you become proficient in SQL (and to a certain degree PHP and XML) and not in some obscure ‘control language’, used for transformation and joining all kinds of data sources and first in the last step update the data store.

The motivation for parallel process steps in workflows is to decrease wall clock time, and PHP is probably not the most efficient language for parallel execution, luckily it does not matter much for my job scheduler since we do not want to shrink seconds into hundreds of a second rather hours into minutes and in that perspective PHP is more than performant enough. It’s rewarding to parallel process Business Intelligence integration processes, with massive above I didn’t mean very many parallel threads, but massive amount of work, you can easily beat the shit out of powerful servers with thirty or so concurrent ETL workflows, not to mention the stress these processes might cause in the source system.

Parallel workflow processing in the job scheduler is done in several ways:

The jobscheduler is cloned in servers, we run two separate instances of the job scheduler, it scales nicely.
Multiple workflows is scheduled simultaneously.
Multiple steps are submitted for execution in parallel.
Individual steps are split up in smaller chunks and executed in parallel.

Workflows are submitted for execution from Cron, via simple bash scripts:

Here is a bash script submitting workflows. First a workflow jm2daily.xml is executed. Then a number of workflows are submitted in parallel (nohup &). Lastly there are a number of workflows submitted one by one in sequence. A workflow is called a schedule . A schedule consists of directives and workflow steps which are called jobs . Normally schedules are independent of each other or have a simple sequential dependence.

Here we have a schedule with two interesting directives, the first prereq indicates a SAP job with the name prefix ‘AM152-SALES_STAT’ must have executed successfully during the last 12 hours otherwise the prereqwait directive will sleep for 300 seconds and then evaluate the prereq again and repeat for 10000 seconds or until 08.00.00 whichever happens first. You can setup any number of prereq checks to synchronize processing between schedules and other events in the network infrastructure.

Up till now I described the need for parallel processing of computer batch workflows and parallel execution and synchronization of workflows. In the next post PHP parallel job scheduling-1 I will describe how I parallel execute workflow steps something I find more interesting.

2013-10-14

Parallel processing of workflows - 2

No comments:

Post a Comment