2013-09-23

Splitting XML-reports zip & CIFS copy them to Windows

The other day I had a meeting with the project manager for an application that receives files from the Data Warehouse . The procedure for sending files to this application is interesting. First we create a complex XML file, then we zip it and lastly we connect to a Windows server share and copy the zip file. Three jobs to create the file, zip it and send it. The first job creates the XML file by execute a SQL Select and then hand result over to the converter  sqlconverter_xml02.php .

 

The second job zips the output file report0.xml.

The third job ships the zip file ItemMessage.xml to the Windows share:

The job navettiCopy2WIN.xml mounts the windows share as a CIFS file and the copy the zip file to the share and finally unmounts the Windows share:

 If you follow the init actions  you see how it’s done. Note I put in a sleep before unmount, in the past we have had some timing problems. If something goes wrong we execute exit failure .

XML wrong tool for data-interchange.

This job schedule is complex and I was not happy when the project manager of the receiving application told me ‘ The file is too large, can you please chop it up in smaller chunks ’. It’s probably their XML parser that cannot cope with large input. Most XML encoding/decoding I’ve seen is done in memory, so very large XML files tend to exhaust RAM. Actually XML code large data streams is a bad idea since XML encoding is verbose and parsing inefficient. I had problems with my SQL to XML conversion. First I based it on PHP simplexml  but when I exhausted the memory, I took the very bad decision to create my own XML coder sqlconverter_xml02.php (see above), it’s a piece of crap but it works. But I should have said no to large XML data-interchange, Google Buffers or CSV or even JSON are better alternatives.

Job Iterator to the rescue.

Anyway to make the other application work I had to split the XML file. Due to the XML mess I thought the split would be problematic, but after some thinking I realized I could use a  job iterator  and it turned out to be very easy to chop the file. The XML converter sqlconverter_xml02.php do not append the result to an existing file but creates a new sequenced numbered file. If you look below you see the target for the sqlconverter is  ItemMessage . The sqlconverter_xml02.php will name the first result file ItemMessage 0 .xml, the next ItemMessage 1 .xml etc. Subsequently the zipfiles job sweep them all up by the glob wildcard ‘ItemMessage * ’. This test was  done in minutes and worked right out of the box. In production the iterator would be SQL generated based on the actual result size.

First I created a job iterator <forevery> with offset and limit for each chunk, then I added the limit clause  to the SELECT statement and changed the SQL target to ItemMessage. (This SQL converter adds a sequence number to the filename if the file already exists.)

Then I just had to slurp up all created output tables in the zip job and send the zipfile to the Windows share.

If you scrutinize the schedule XML  you find there are some functionality under the hood. I’m still very happy with my PHP job scheduler  and the Integration Tag Language.

I end this post with the zipfile.php  script. The ZipArchive and Zipper classes I found somewhere on the net but I forgot to credit the authors, if you know who should have the honor for these classes please tell me.

$ziplib = fullPathName($context,$job['library']);

$ziplibcreate = $job['create'];

$log->enter('Info',"Zip library=$ziplib, create=$ziplibcreate");

$zip = new Zipper;

$ok = $zip->open("$ziplib", ZipArchive::CREATE);

if ($ok === FALSE) {

        $log->logit('Error',"Failed to create/open zip library=$ziplib, create=$ziplibcreate");

        return FALSE;

}

foreach($job['file'] as $finx => $file) {

  foreach (glob($file['name']) as $globfilename) {

      if ($file['name'] == "$globfilename") $localname=$file['localname'];

      else $localname = '';

      $path=$globfilename;

      if ($localname == '') {

              $path_parts = pathinfo("$path");

              $localname = $path_parts['filename'].'.'.$path_parts['extension'];

      }

      if (is_file($path)) {

          $log->logit('Note',"Zip $path into $localname");

          $zip->addFile("$path", "$localname");

      } else {

          $zip->addDir("$path");

      }        

  }

}

$zip->close();

$_RESULT = TRUE;

return $_RESULT;

class Zipper extends ZipArchive  {

    public function addDir($path) {

//        print 'adding ' . $path . "\n";

        $this->addEmptyDir($path);

        $nodes = glob($path . '/*');

        foreach ($nodes as $node) {

            print $node . '<br>';

            if (is_dir($node)) {

                $this->addDir($node);

            } else if (is_file($node))  {

                $this->addFile($node);

            }

        }

    }

} // class Zipper

2013-09-08

Moving a Data Warehouse - 3

In some posts I have written about the migration of my Data Warehouse  from my own hardware to Dell servers. This move was partly initiated by my transition to a new job in the company, the management didn’t dare to run the Data Warehouse on the hardware I built .

When I originally designed the Data Warehouse infrastructure two key principles were low cost  and simplicity . I needed a database so I created a database server. I needed an ETL engine so I created an ETL server. Then I needed PhpMyAdmin so I created a PhpMyAdmin server. One function one physical server.  The only extra in my Irons was an extra Network Interface for an internal server network. And in the beginning servers were scrapped IBM desktops. I maximized RAM, replaced the hard disk and added a network interface, dirt cheap. Then I installed a Linux and one application, fired up the new server and forgot about it for two to five years. My servers were mostly replaced when I needed more capacity not due to hardware failures. One of very few hardware problems I have had is described here .

I avoid software tweaking and optimization, I try do do standard installs right from the distro, e.g. I choose ‘big’ for Mysql config file that’s is about how much tuning I do. I have all databases and indexes on the same disk! About half a terabyte database with about eight million queries a day ( I have seen peaks over 15 million queries a day), this is on a custom made server with 16GB RAM.

Now this has changed with the migration to ‘real’ servers. The one server one function  approach would have been all too expensive, so I had a choice either pack more functions in one server or go virtual. I have for some years wanted to test a virtual solution, so I decided to go virtual without testing. I decided I go for two servers one physical database server and virtual host for all other servers. I do not believe for a second you can have a low cost simple virtual high performance database server. But the rest of my Data Warehouse servers could well be virtual, this way I could keep my one function one server  philosophy and still be reasonable cost efficient. I was right and I was wrong.

The new environment is much more complex, virtual servers add a software abstraction layer between the iron and Linux, and by going virtual you also need a software layer between your hard disks and the virtual servers for practical space management. For all this to work you need an expert to manage this environment. And an expert costs and the expert has his own preferences and experiences, e.g. a Linux professional does not necessarily know Mageia Linux. Since we didn’t have in house Linux operations expertise we hired a consultant. A mistake was not to listen to the consultants recommendation of virtualization software and Linux distro.  Not that it’s difficult for a Linux professional to learn another distro it just takes some time, but more important, the support of my now non standard infrastructure it will always be exotic for the consultants operations team. I should have spent more time with the consultants upfront going thru the server setup. We would probably have had a better server setup still adapted to my Data Warehouse.

This is complication I would not have had if we had used Windows Server instead of Linux distros, since Windows is a singular opsys. I do not know if this is good or bad.

I end this post with a humble statement. Still few people seem to have my insights in hardware  and infrastructure for Business Intelligence systems. Actually very few I talk to make any distinctions between any type of applications in this respect, the same hardware fits all give or take some RAM and CPU; that is the adaptation to applications you see. I believe most hardware infrastructure is grossly overpowered/priced and designed for ERP transactional applications.