I have written a nice automatic PHP Curl procedure to scrape currency rates from an internal web page. It works beautifully, and has done so since a year. Now I’m on a
road trip to southern France
, and this morning I got an emergency alert from the routine ‘URL inaccessible’. WTF has happened? It took me some minutes to realize the web server is not responding. I missed this in the design, what to do if this web server is down. It is a serious error in financial reporting not to have valid currency rates, so the entire reporting system/Business Intelligence is stopped! But I didn’t worried too much, I alerted the operations and hoped they could restart this server in minutes. I called the operations early morning from Wiesbaden in Germany they restarted the web server and I got a new error msg from my routine ‘the URL cannot be found’. I misinterpreted the error msg. I still thought there was a problem with the server. I find it hard to do debugging in 200 km/h, just being able to connect over a 3G modem on Autobahn is amazing. I think I have a record in speed-debugging. Later in an hotel room I found out the URL in the webpage was changed and the structure of the webpage was also changed. I realized I would not be able to fix this on my vacation. I discussed the situation with the only remaining colleague at the office (this is vacation period in Sweden). We decided the best thing to do was copy the previous month currency rates, and correct the figure after my vacation.
If I had done a proper risk analysis up front this situation would probably not have happened. But if you always do everything according to the book, not much is done. It is a fine balance how much effort should be spent on design up front, if the resources are scarce you have to compromise. And to be honest I seldom overdo the design work, I often start the build phase a bit premature, most of the time it works well but this time it came back and bit me, at the most inconvenient time. Murphy’s law applies as usual.
I will redesign the system so it always works with the most appropriate currency rates entered (the latest), and some self-correcting mechanism when the correct rates are entered. It shouldn’t be that hard to do.
I end this with the Curl PHP code. This is my first and only piece of Curl code, so I have used examples found on the web, some of the code I do not understand well myself,
but it works beautifully when the web server is up, and when it’s down! I got an alert telling me the servers down, what more can you expect.
but it works beautifully when the web server is up, and when it’s down! I got an alert telling me the servers down, what more can you expect.
#!/home/tooljn/PHP5.3/usr/local/bin/php
<?php
/**
* @package dummy
* @ignore
*/
/**
* This code scrapes
a webpage and downloads an excel file linked from the webpage
a webpage and downloads an excel file linked from the webpage
* //$link = $webhost.'/'.'Rates)';
* @return bool TRUE
if all is hunky dory
if all is hunky dory
*/
$scriptName = $job['pgm'];
$scriptParm = $job['parm'];
$curlAgent = $context['_xlate']['C_application'];
$eop =$job['eop'][0]['value'];
if (!$eop) $eop=15;
//default
//default
$period =date('Y-m',mktime(0, 0, 0, date("m"), date("d")-$eop,
date("Y")));
date("Y")));
$webhost =$job['webhost'][0]['value'];
$weburl =$job['weburl'][0]['value'];
$log->enter('Info',"Enter
script=$scriptName, job=$jobName, Curl agent=$curlAgent, period=$period");
script=$scriptName, job=$jobName, Curl agent=$curlAgent, period=$period");
$link =$webhost.'/'.$weburl;
$result =get_web_page($link,$curlAgent);
file_put_contents('WEBPAGE.txt',$result);
$excelname ='EXC'.$period.' Published';
$link =get_excel_link($excelname,$result,$period);
if ($link == NULL) {
$log->enter('Error',"No
currency rate Excel link for period=$period, end of period=$eop");
currency rate Excel link for period=$period, end of period=$eop");
$_RESULT = FALSE;
return FALSE;
}
$link =$webhost.'/'.$link;
$sufix =substr($link,strrpos($link,'.'));
$excelfilename ='excel'.$sufix;
get_web_file($link,$excelfilename,$curlAgent);
file_put_contents('EXCELFILE',serialize(array(
array('EXCEL' => $excelname, 'WORKBOOK' => $excelfilename,'PERIOD'=> $period, 'LINK'=>$link ))));
$_RESULT = TRUE;
// indicate successful execution
// indicate successful execution
return(TRUE);
/**
* This function
extracts the latest link to monthly Exchange rates Excel workbook
extracts the latest link to monthly Exchange rates Excel workbook
*/
function
get_excel_link($excelname,$htmltxt,$period){
get_excel_link($excelname,$htmltxt,$period){
$linktab = get_link_tab($htmltxt);
return $linktab["$excelname"];
}
/**
* This function
extracts html href and creates an associative array Excelname => Url
extracts html href and creates an associative array Excelname => Url
*/
function
get_link_tab($htmltxt){
get_link_tab($htmltxt){
$exceltab = array();
$regexp2 = "alt=\"[^\"]*";
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU",
$htmltxt, $matches, PREG_SET_ORDER)) {
$htmltxt, $matches, PREG_SET_ORDER)) {
foreach($matches
as $match) {
as $match) {
preg_match("/$regexp2/",
$match[3], $matches2);
$match[3], $matches2);
//
$match[2] = link address
$match[2] = link address
//
$match[3] = link text
$match[3] = link text
$excelna
= explode('.',(substr($matches2[0],5)));
= explode('.',(substr($matches2[0],5)));
$exceltab[$excelna[0]]
= $match[2];
= $match[2];
}
}
//
var_dump($exceltab);
var_dump($exceltab);
return $exceltab;
}
/**
* Get a web file
(HTML, XHTML, XML, image, etc.) from a URL. Return an
(HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing
the HTTP server response header fields and content.
the HTTP server response header fields and content.
*/
function
get_web_file($url,$fileName,$curlAgent){
get_web_file($url,$fileName,$curlAgent){
//print "get_web_file $url,$fileName,$curlAgent \n";
$fp = fopen ("$fileName", 'w+');
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_TIMEOUT, 50);
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// curl_setopt($ch,CURLOPT_USERAGENT, "$curlAgent");
curl_exec($ch);
curl_close($ch);
fclose($fp);
}
/**
* Get a web page from
a URL. Return an array with the contents of the array
a URL. Return an array with the contents of the array
*/
function
get_web_page($url,$curlAgent){
get_web_page($url,$curlAgent){
$options = array(
CURLOPT_RETURNTRANSFER
=> true, // return web page
=> true, // return web page
CURLOPT_HEADER
=> false, // don't return headers
=> false, // don't return headers
CURLOPT_FOLLOWLOCATION
=> true, //
follow redirects
=> true, //
follow redirects
CURLOPT_ENCODING
=>
"", // handle all encodings
=>
"", // handle all encodings
CURLOPT_USERAGENT
=>
"$curlAgent", // who am i
=>
"$curlAgent", // who am i
CURLOPT_AUTOREFERER =>
true, //
set referer on redirect
true, //
set referer on redirect
CURLOPT_CONNECTTIMEOUT
=> 120, //
timeout on connect
=> 120, //
timeout on connect
CURLOPT_TIMEOUT
=>
120, //
timeout on response
=>
120, //
timeout on response
CURLOPT_MAXREDIRS
=>
10, //
stop after 10 redirects
=>
10, //
stop after 10 redirects
CURLOPT_POSTFIELDS
=>
"",
=>
"",
CURLOPT_VERBOSE
=> false,
=> false,
);
$ch = curl_init($url);
curl_setopt_array($ch, $options);
$content = curl_exec($ch);
$err = curl_errno($ch);
$errmsg = curl_error($ch);
$header = curl_getinfo($ch);
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
//var_dump($header);
return $content;
}
No comments:
Post a Comment