Some days ago a I wrote a post about the speedy database server of the Data Warehouse. A month ago I blogged about a problem we had with the same server, it was like someone had filled the server with glue figuratively speaking, the server was excruciatingly slow. I could not find the reason for the slowdown, but a reboot fixed the problem. Yesterday it happened again, I spent some hours trying to figure out what the problem was but I came up with nothing, to prevent a new disaster the server was rebooted, but this time it refused to come up again. Looking at the server one disk showed very high activity while the other disks in a raid6 group did nothing, very suspicious. But no alerts no red lamps no nothing, "I'm fine just a bit busy but I'm alright" the disk was saying. Anyway Linux refused to start up, claiming it did not have time enough to get hold of the file system. After some help of an external consultant we found out the high activity disk was faulty and we pulled it out of the raid and all went back to normal again. (The disk problem was recorded in a hardware log I didn't know of).
A disk raid is a complex thing and it is or rather should be fault tolerant. But what use do you have of a safe raid if it slows down so much nothing comes out of it. I have found disk raids and raid controllers unreliable, if something breaks in a server it's likely connected to raids. I prefer single disks and backups. When a disk breaks replace it, apply the backup and eventual redo logs, rerun jobs if needed and then you are back in business again. In (update) transactional applications this approach might not be possible, then fault tolerant disk raids plus hot standby backup system might be necessary. since the chance of loosing data is less with a fault tolerant raid setup. This is contradictory I know, but raids are complex and hard to 'debug' and I do not not trust them much.
2014-04-19
2014-04-02
No April Fools Day Joke
If you follow the Data Warehouse on Twitter you can see some staggering high query figures the last couple of days. The first of April The Data Warehouse responded to 182.814.418 queries, and that is no April fools day prank. That’s impressive by any standard and far from the regular 9 million queries. I thought the figures were wrong, but it turned out to be correct. They BI crew is implementing a new ‘purchase delivery performance’ routine. In the SAP source system there is no relation between Purchase Order Lines and Deliveries. So the BI team has developed a SQL stored procedure that calculates delivery performance. This stored procedure is ‘query intensive’, and what you see in the recent tweets is testing and calculation of Purchase Order history. When we are in a steady delta calculating phase it will be considerable lower.
I tried to create a PHP ‘in memory’ script for this calculation, but I run out of 10GB memory trashing the Data Warehouse ETL server. Since we have sufficient memory in the database server, this more elegant stored procedure is by and large ‘in memory’ computing.
Stress testing
We figures out that if it could cope with 50 concurrent call that would be more than it will ever be stressed in real life, so I run a test with 100 concurrent calls. This is how i did it:
The XML tag <forevery rows='100' parallel='yes'/> iterates the job 100 times in parallel.
It took 4 seconds to do that: start the job, create 100 threads, call the web service, receive a result, write it to disk and finally cleanup.
The execution of this ITL workflow is done in PHP including the parallel scheduling of the job 100 times. Much of the 4 seconds are probably PHP execution and server communication. The web service is more stress resistant than we will ever need.
Subscribe to:
Posts (Atom)