Some days ago a I wrote a post about the speedy database server of the Data Warehouse. A month ago I blogged about a problem we had with the same server, it was like someone had filled the server with glue figuratively speaking, the server was excruciatingly slow. I could not find the reason for the slowdown, but a reboot fixed the problem. Yesterday it happened again, I spent some hours trying to figure out what the problem was but I came up with nothing, to prevent a new disaster the server was rebooted, but this time it refused to come up again. Looking at the server one disk showed very high activity while the other disks in a raid6 group did nothing, very suspicious. But no alerts no red lamps no nothing, "I'm fine just a bit busy but I'm alright" the disk was saying. Anyway Linux refused to start up, claiming it did not have time enough to get hold of the file system. After some help of an external consultant we found out the high activity disk was faulty and we pulled it out of the raid and all went back to normal again. (The disk problem was recorded in a hardware log I didn't know of).
A disk raid is a complex thing and it is or rather should be fault tolerant. But what use do you have of a safe raid if it slows down so much nothing comes out of it. I have found disk raids and raid controllers unreliable, if something breaks in a server it's likely connected to raids. I prefer single disks and backups. When a disk breaks replace it, apply the backup and eventual redo logs, rerun jobs if needed and then you are back in business again. In (update) transactional applications this approach might not be possible, then fault tolerant disk raids plus hot standby backup system might be necessary. since the chance of loosing data is less with a fault tolerant raid setup. This is contradictory I know, but raids are complex and hard to 'debug' and I do not not trust them much.
No comments:
Post a Comment