The big data technology landscape consists of such a large number of choices, that often the most critical step in successfully implementing a solution is choosing the right platform that will address the requirements of the problem at hand, and that is sustainable in the long term.
With such a large number of choices though, doing a feature-wise comparison of all individual platforms is just too complex and time-consuming. However, it is possible to group these options based on the data models they support.
The following technology comparison matrix compares different predominant data models, their capabilities, typical applications as well as limitations.
The simplest model where each object is retrieved with a unique key, with values having no inherent model
Utilize in-memory storage to provide fast access with optional persistence
Other data models built on top of this model to provide more complex objects
Applications requiring fast access to a large number of objects, such as caches or queues
Applications that require fast-changing data environments like mobile, gaming, online ads
Cannot update a subset of a value
Does not provide querying
As the number of objects becomes large, generating unique keys could become complex
Extension of a key-value model, where the value is a structured document
Documents can be highly complex, hierarchical data structures without requiring pre-defined “schema” Support queries on structured documents
Search platforms are also document-oriented
Applications that need to manage a large variety of objects that differ in structure
Large product catalogs in e-commerce, customer profiles, content management applications
No standard query syntax
Query performance not linearly scalable
Join queries across collections not efficient
Extension of the key-value model, where the value is a set of columns (column-family)
A column can have multiple time-stamped versions
Columns can be generated at run-time and not all rows need to have all columns
Storing a large number of time-stamped data like event logs, sensor data
Analytics that involve querying entire columns of data such as trends or time series analytics
No join queries or sub-queries
Limited support for aggregation
Ordering is done per partition, specified at table creation time
Models graphs consisting of nodes and edges with properties (meta-data) describing them
Implement very fast graph traversal operations
Also, support indexing of metadata to enable graph traversal combined with search queries
Applications that deal with objects with a large number of inter-relations
Applications like social networking friends-networks, hierarchical role-based permissions, complex decision trees, maps, network topologies
Difficult to scale for large data sets for generic graphs
Giraph uses the Bulk Synchronous Parallel model to overcome some of the scalability limitations
Conventional RDBMS structure consisting of a fixed schema with
ACID properties provides well documented and widely supported SQL syntax
Capable of complex queries including sub-queries and joins
Transactional data applications like ERP, CRM, Banking etc.
Applications where data volume is limited and schema are by and large fixed
Lacks horizontal scalability and hence limited in handling “big data”Not efficient at handling complex multi-level nested data
Cannot handle “unstructured” data where the structure is not known at design time
Choosing the right platform would involve mapping the requirements in terms of the data model, and the type of querying and data access patterns required. The table above includes a brief overview of common features supported by various engines and it can help shorten the list of options. However, it surely requires a much finer analysis of the individual engines to make a final choice for any application.
It may seem like there is a glaring omission of Hadoop in the above table. Hadoop is so widely known in the context of big data and NoSQL, that often it is mistaken for a database.
Hadoop fundamentally consists of:
Hadoop Distributed File System (HDFS) – a distributed file storage system with built-in replication and fault tolerance, Hadoop YARN – a framework for job scheduling and cluster resource management Map Reduce – a distributed programming model to process a large number of objects
There are database engines that are built on top of Hadoop such as HBase, Hive, Giraph etc. that provide different database models, and these are included in the above table.