2013-04-07

Relational Big Data

Do you want sufficient fast access to data or the fastest access to data?  

In Big Data and Relational stores  I described differences between Big Data and Relational data. The starting point for my post was - ‘there is a misconception Big Data is needed for efficient manipulation of large volumes of data’. I described Big Data as a document database where a document is a number of values/attributes tied together by a key and not much more. The Big Data manager can do some clever compression and optimization of physical layout to minimize both disk space and access time.
If you remove all but one value you are left with a key value pair , if you store such pairs you have a key value store. It is perfectly ok to store all your Big Data as key value pairs, not very convenient but you can do some heavy optimization both in terms of space and performance. If your data lend itself to a key value store it is very hard to beat Big Data.
If we look at the relational data, it is very hard to compete with Big Data in terms of space utilization, this is not a big deal since disk space is dirt cheap these days. But accessing disk takes time so the larger the disk space is the longer it will take to read it. But there is a remedy for disk access time, move all data into RAM memory and replace regular hard disks with SSD, this will significantly reduce access time. It is true Big Data will benefit even more from fast disk access since the data structure is simpler. Yet again the relational data manager has a trick up his sleeve, instead of browsing through all data the clever database admin creates covering indexes . Theoretically  covering indexes  could be accessed as fast as a key value dataset,  the relational database managers are not there yet, but they will eventually be.

Map and Reduce

Big Data parallelize data access by mapping the data access onto arbitrary number of workers and then reduce the results by merging them together. This parallel programing technique is called Map and Reduce , a design pattern I’m well acquainted with.
I used Map and Reduce 1991 when I created a search engine I called ‘Fast Search for Structured Data’, I put in Structured  to emphasize the search engine was not primarily designed for free text search. The data was stored in key value bitmaps. A test we did comparing my search with DB2 and a  Cobol based serial processing precursor to my map and reduce program. DB2 we had to stop after 23 hours with no result, the cobol program took 20 minutes and my program less than 5 seconds.
Rightly tuned Map and Reduce applied on key value data can be incredibly fast on large data volumes. DB2 and other relational managers have come a long long way since 1991, rightly tuned they can be incredibly fast on large data volumes too.

Relational big data

There is nothing that stops you or the relational database manager from applying Big Data techniques on relational data. Here  I show how I have applied Map and Reduce on relational data.You can also dramatically increase relational performance with good index management.

When are my data volumes so large I need Big Data?

When the data volume becomes a challenge to your design you have ‘big data’ volumes. If your data is contained in a file cabinet and all data access is done by you manually, big data volumes may be a few thousand pieces of data. Does the relational model have conceptual or inherent problems with large data volumes? No it has not, but in some edge cases ‘Big Data’ concepts scales even better. The performance gains comes at a cost, you lose some control of your data and you lose a uniform method (SQL) for data access.
Both Big Data and Relational are borrowing/stealing from each other. There are Big Data with some SQL support and relational managers incorporating Big Data stores. What kind of Data Managers and Data Stores we have in the future remains to be seen. On the global scene the data production super-inflates and many more players want to capture more data. There is a demand for better data management.   Yesterday most data captured for analysis came from ERP systems, today it is probably Web trade and tomorrow social media .
Personally I do not believe we live in a post relational age. I bet my 5 cent on SQL and relational will prevail, but they will evolve in many ways, one of them may be Big Data. I also believe there is a place for Big Data, there will be more large data volume applications tomorrow where speed matters most and there Big Data may play a niche role. I can also see Peoples Republic of China has a great need of Big Data solutions for various reasons, this is probably not a small niche though.

No comments:

Post a Comment