Big Data. Seifedine Kadry

Чтение книги онлайн.

Читать онлайн книгу Big Data - Seifedine Kadry страница 19

Big Data - Seifedine  Kadry

Скачать книгу

rel="nofollow" href="#ulink_c11848e8-4428-5f5b-a6d1-8151b8396818">Figure 2.11 illustrates the combination of sharding and replication where the data set is split up into shard A and shard B. Shard A is replicated across node A and node B; similarly shard B is replicated across node C and node D.

image

      A file system is a way of storing and organizing the data on storage devices such as hard drives, DVDs, and so forth, and to keep track of the files stored on them. The file is the smallest unit of storage defined by the file system to pile the data. These file systems store and retrieve data for the application to run effectively and efficiently on the operating systems. A distributed file system stores the files across cluster nodes and allows the clients to access the files from the cluster. Though physically the files are distributed across the nodes, logically it appears to the client as if the files are residing on their local machine. Since a distributed file system provides access to more than one client simultaneously, the server has a mechanism to organize updates for the clients to access the current updated version of the file, and no version conflicts arise. Big data widely adopts a distributed file system known as Hadoop Distributed File System (HDFS).

      The key concept of a distributed file system is the data replication where the copies of data called replicas are distributed on multiple cluster nodes so that there is no single point of failure, which increases the reliability. The client can communicate with any of the closest available nodes to reduce latency and network traffic. Fault tolerance is achieved through data replication as the data will not be lost in case of node failure due to the redundancy in the data across nodes.

       image image

      Relational databases become unsuitable when organizations collect vast amount of customer databases, transactions, and other data, which may not be structured to fit into relational databases. This has led to the evolution of non‐relational databases, which are schema‐less. NoSQL is a non‐relational database and a few frequently used NoSQL databases are Neo4J, Redis, Cassandra, and MongoDb. Let us have a quick look at the properties of RDBMS and NoSQL databases.

      2.4.1 RDBMS Databases

      RDBMS is vertically scalable and exhibits ACID (atomicity, consistency, isolation, durability) properties and support data that adhere to a specific schema. This schema check is made at the time of inserting or updating data, and hence they are not ideal for capturing and storing data arriving at high velocity. The architectural limitation of RDBMS makes it unsuitable for big data solutions as a primary storage device.

      For the past decades, relational database management systems that were running in corporate data centers have stored the bulk of the world’s data. But with the increase in volume of the data, RDBMS can no longer keep pace with the volume, velocity, and variety of data being generated and consumed.

      Big data, which is typically a collection of data with massive volume and variety arriving at a high velocity, cannot be effectively managed with traditional data management tools. While conventional databases are still existing and used in a large number of applications, one of the key advancements in resolving the problems with big data is the emergence of modern alternate database technologies that do not require any fixed schema to store data; rather, the data is distributed across the storage paradigm. The main alternative databases are NoSQL and NewSQL databases.

      2.4.2 NoSQL Databases

      A NoSQL (Not Only SQL) database includes all non‐relational databases. Unlike RDBMS, which exhibits ACID properties, a NoSQL database follows the CAP theorem (consistency, availability, partition tolerance) and exhibits the BASE (basically, available, soft state, eventually consistent) model, where the storage devices do not provide immediate consistency; rather, they provide eventual consistency. Hence, these databases are not appropriate for implementing large transactions.

Key‐value databases Document databases Column databases Graph databases
Redis MongoDB DynamoDB Neo4j
Riak

Скачать книгу