In my previous post Job Scheduling with PHP - 1 I described the basics about job scheduling in general. Here I write about the background and the underlying design principles for my job scheduling system written in PHP.
I created my job scheduling system for a Business Intelligence BI System. An important process in any BI system is Extract Transform & Load (ETL). The ETL process is background job scheduling. Good scheduling is essential for BI systems. I didn't have any funds for my BI project so I had to use existing (scrapped) hardware and Free Software. I mentioned in my previous post I was (still am) not very impressed by Job Scheduling systems on the market. I am a programmer by trade and I had used scripting languages for controlling processes before. I looked around for software to use, and Linux and MySQL was easy ones to pick. A programming language was harder to find, first I looked for ReXX my scripting language par preference. But I couldn't find ReXX in Linux, the languages I found were PERL and PHP. PERL was the better language, my impression of PHP was a tool for simple Web apps. But I couldn't resist the challenge to use PHP for advanced background processing. I did the first ETL controller in two crude simple PHP scripts. scriptS.php (S as in start), and scriptF.php (F as in function) where I stored the functions used in scriptS.php. All ETL processes were hard coded and everything was very primitive, but it worked pretty well. But as the BI system grow, it became clear my two scripts were a dead end, my hard coded scripts were not scalable, I had to go back to the drawing board and begin from scratch again.
By now (2005-2006) Object Orientation had arrived in PHP, I'm not fond of OO programming, but a logging subsystem is perfect to objectify, so I learned PHP OO by creating a logger class before I started design version 2 of my job scheduler. Next I disconnected all configuration and job definitions from the PHP scripts. I wanted a strict but extendible syntax that was easy to parse. This was an easy pick; XML was the obvious choice, and PHP had a very simple and capable enough XML parser simpleXML. Now I had to define the environment , the scheduling and the jobs.
The environment defines execution elements like databases, programs, directories etc. all is defined in XML scripts, the main context script points to other XML scripts defining the total execution. Here you see a <sap> tag pointing to XML script defining a SAP system. The <prereq> are Boolean statements that must be true.
The schedule XML script defines a chain of jobs that is scheduled for execution with or without dependencies. Variant define startup parameters and the first job point to an XML script 'exp8_generate_iterators'.
This job XML Script defines the execution of series of SQL statements.
Having laid a sound foundation with three well defined entities (context, schedule & job) and a logger, I also needed an execution plan for my scheduling system. I decided to have divide execution into three phases :
- Read and parse all XML scripts and syntax check. And check prerequisites i.e. access to input files and subsystems like MySQL , predecessor conditions etc.
- Create the execution environment, it is a directory structure where all things from the execution of a schedule are stored, e.g. log files
- The actual execution of a schedule.
Phase 1 creates an execution tree which basically is the parsed XML files into a PHP array structure. This tree is passed to phase 2 where more 'things' are added to the executions tree as the execution environment is created. This environment is then passed to phase 3, which then executes the schedule job by job and records the outcome of each job into the execution tree.
This is a schematic view of a schedule execution (the picture is old and some entities have been renamed).
I often use my own Garbage In - Garbage out design pattern which means all not recognized is treated as noise and defaults are non destructive. This design pattern is both code friendly , the code do not have to consider unknown parameters etc, and user friendly you do not need to know all details, try the software you will not destroy anything by misspell a parameter or leave something out.
But you have to give defaults some thoughts, they should be non destructive and sensible. This is actually quite hard. I’m sure you many times have seen idiotic defaults.
Another design pattern I often use is something I invented years ago when I did large systems in assembler language. I posit all goes wrong use boolean FALSE return code and return as soon as I find something wrong, this gives submodules often with many FALSE returns and only one TRUE or non-FALSE return at the end. If you are careful and design your submodules or functions with minimal side effects, you can often avoid cleanup code. The caller either deals with the false return code or exit himself with the FALSE return code. This design pattern gives flat, efficient and robust programs.
I already stated I prefer a simple boolean return code structure TRUE or FALSE . Either an action is a success or not, black or white if you wish, no gray zones. You probably have seen other return code schemes. For reason of the branch on count assembler instruction very many return code schemes is based on zero=success, 4=remark,8=warning,12=serious warning etc. This code scheme is not only confusing, error prone it is also out of sync with modern computer languages where zero=Boolean FALSE and everything else is Boolean TRUE. Multi value return code schemes may also force you to write code like if not ok then ok else not ok or even more horrid constructs.
For a job scheduling system return codes are very important, jobs are dependent of predecessor jobs, errors must be fixed in successor jobs, you must be able to set up guards that kicks in when things go wrong. A simple return code structure is a boon not only in the code of the of the job scheduling system itself, but also to the job scheduling. In my system a schedule can only successfully execute or fail execution and the same goes for the jobs or almost.
In job scheduling you have to deal with the situation where a job is bumped over due to preconditions not met. Is that a failure or a success? IBM’s Job Control Language treat that as a success. This may seem absurd, but the opposite may be equally absurd, it depends entirely from what angle you entering the problem of a bumped over job. Even considering non executed jobs is a simplification, there are more things to consider when deciding the outcome of a job.
My defaults - all jobs must execute successfully, bumped over jobs are considered a success the schedule execution is intercepted when a failure is detected covers more than 95% of all job planning as long as you run one job after another single threaded, when you run jobs in parallel things get more complicated.
Remember I wrote Job Scheduling is important for the BI ETL process? BI systems contains large amounts of data and imports large amount of data via ETL processes, parallel processing to cut ETL execution time short is essential for BI systems. My job scheduler deals with parallel processing in basically two ways one by parallel process jobs and cut jobs up in smaller pieces/chunks. This I have described in when fast in not enough , in parallel processing of workflows I described parallel execution in detail.
Now parallel processing and multi threading is crude and awkward in PHP, but it can be done.
This and more I will try to write some more posts about. I end this post with the execution of an empty schedule. This is how this empty.xml schedule file looks:
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
Remember what I wrote about defaults, non destructive and sensible, here we have quite some defaults to fill in. Here we go:
What can be more appropriate for a default job than to TTY type contents of the little red box in the middle of the log display.
Note the first line; The Job Scheduler still start with the scriptS.php module.
I hope I will be able to continue to write about my PHP job scheduler.
Some examples can be found here .