2013-04-01
Big Data and Relational data stores
The concept of Big Data has had a big impact in the corporate world. At least IT is talking about Big Data, even top level IT management. But few actually knows what it is about, a common misconception, ‘we need Big Data to efficiently manage our very large increasing data volumes’. I have not only come across this misconception once or twice the last six month but many times. Where does this ideas come from? Probably from evangelists who have seen the Big Data light, and more important those who think they can make a buck or two by selling Big Data solutions. All of a sudden all major soft- and hardware vendors have Big Data solutions for sale.
What is Big Data then? Before I go into that a short recap of what we have today the relational data model.
In the relational database model data is organised in tables . Data is stored in tables, can be viewed in virtual tables and the result of operations on data is returned in tables. Data in a table is strictly ordered in columns and rows . Tables are organized into databases . The SCHEMA database is a special database where all databases, tables, columns and rows must be predefined before they can be used. Data normalisation is a strict procedure where (unstructured) data is deconstructed into structured tables. All access to data are done via a common language Structured Query Language (SQL).
In the relational model there exists a high degree of order, where data is described in strict meta data rules in the SCHEMA database, these rules cannot be violated, the relational database manager guarantees the data integrity .
In Big Data little of these concepts exists, there is no table structure, no SCHEMA equivalent, no common data access language. In fact Big Data started it’s life as NoSQL and SCHEMA free databases for documents and unstructured data and was only recently rebranded Big Data.
Now wait a minute ‘ Haven’t I heard this before? ’ Yes, in the beginning there was the Lotus Notes database . Lotus Notes database is the mother of Big Data, that is something very few Big Data guys talk about. In the beginning Big Data was basically defined as ‘not relational’, this is not a good marketing concept, I think that’s why the more positive sounding Big Data was conceived.
So what is Big Data?
Big Data a generalized definition.
This is a very superficial generalisation since there is no close relation between the different Big Data model stores. Data is stored in the form of unstructured document or key value pairs often with JSON notation, the data is accessed by programs (often JavaScript). That’s basically it.
Anyone can picture a table, but what does an unstructured document look like, This an attempt
to depict a Big Data Document:
As you see it is text/data/whatever you like to call it, here with the headers Subject, Author, PostedDate,Tags & Body. You store the document by throwing the document to the Big Data Manager and it is stored for later use, simple as that. You access the document by writing a program that checks the document has the headers that defines the type of document you are interested in and then select the subject ‘I like Plankton’ and your program is given the the document(s).
Now we take the same document and deconstruct it into relational tables:
Of the document I constructed four tables (column names are removed for visibility, but they are the sames as the document headers). What I hope is obvious from the pictures - there is more order and complexibility in relational tables compared with Big Data documents . (I simplified the table structure quite a bit and left out a lot of definitions, in real life there is even more order and complexibility). The Tags are moved to a separate table and the link between the Tags and Mails are moved to a special TagMail table. Authors have got a table of it’s own, this is to show: if you add more data about the author(s), that data goes into the Authors table and not into the Mails table. At last I could not resist the temptation to change PostedDate into a decent format.
Here I cannot store the mail just by throwing the document at the relational database manager, I have to map the mail into the table structure and issue separate SQL update requests against the tables in the right order. This is definitely more complex than just throw the document as it is to the Big Data Manager. I can then access the mail by joining these tables together with SQL. If this is simpler than create a program of the Big Data manager’s choice is a matter of taste. I happen to think SQL is simpler and better, and it is one unifying language for all relational managers.
Relational data is ordered and documented to a very high degree, whereas Big Data is not, here you order and structure your data in the programs that access the data. It is easier to start with a ‘Big Data’ store than a relational. You don't have model your data structure and define your database schema before you start using it. But this is like pee in your pants, warm and cosy at first, but then it’s just wet and cold (and you stink). You must be very careful with your Big Data otherwise you will lose control over your data.
This was a brief generalized overview of Big Data & Relational data stores. One question that arises when reading this is, ‘What has this to do with large data volumes’? Not much, that is another aspect of Big Data and the Relational Database I hope to address in another post.
P.S.
Lotus Notes data administration.
‘Where is my important application data, and what is all this crap’, this is what I hear over and over again from LN administrators. ’We must enforce strict rules for creating databases, only trusted developers should be able to create databases’. This is a mantra the LN admins is chanting. They (and even more top level IT management) try to fight the lack of control over data with masochistic self imposed rules, restricting the creation of LN data.
This is not something you find in the relational camp, there order prevails.
No comments:
Post a Comment