The last week I have mused on ‘What’s in a return code and what it is good for’. It started with the innocent question:
‘How do I see which job in a workflow that bombed out’?
‘The first job with result equals zero’.
‘There is no zero result, there are only ones and nulls’.
My job scheduler return codes are boolean 1=success, 0=failure’. It is not entirely true, the return code can be NULL, which normally means not executed yet. I decided to take a look in the log:
The first job without a return code is trunc_dsaldo, up until trunc_dsaldo all jobs have executed successfully (result=1), it turned out trunc_dsaldo was successfully bypassed, the boolean return code does not really allow for a third ‘bypassed’ condition. The registration of a bypassed job is bypassed altogether so it is impossible to tell a bypassed job from a not executed job.
I like boolean return codes. Either a job executes successfully or not, it could not be more simple if it were not for the bypass condition. In this particular case it was the next job dsaldo_read who failed, due to an infrastructure fuckup the job failed and the connection to the database log table was lost, so it could not register a failure. A very unlikely situation, but nevertheless it happened.
What is the a return code good for?
The most obvious reason the return code should tell the result of a job? In this case it does not do that well. You can argue the result of a bypassed job is unknown and should be left with a Null return code, but you can also say it was successfully bypassed and should qualify for a successful return code, but a bypassed job can be seen as a failure. Right now I lean towards giving bypassed a unique non zero return code but keeping the boolean type. This approach keeps the boolean simplicity but has a side effect it indicates the job was successfully bypassed. I still do not know if this is a good thing or not. I have to scrutinise some unnecessary complex code carefully before I make any changes. If I decide to change the code I will rewrite job related ‘return code’ code, since it has been subject for some patching during the years.
Another and maybe the most important function of a return code is testability, for successor jobs to test the outcome of a predecessor, that has already been taken care of, you can set up a job prereq testing the outcome of a predecessor job example:
<job name=’successor’...>
<prereq type='job' predecessor='previousJob' result='success’ bypassed='ok'/>
|
The successor job will run if the execution of previousJob was a success or previousJob was bypassed.
But the job return code has not got the attention it deserves, it’s a nice way to say there is some odd logic and bad code lurking in my job scheduler concerning return codes. Maybe return code should be a class. I’m not much of an OO fan, but return codes are important and maybe deserves a class of it’s own.