Читать онлайн книгу - Big Data. Seifedine Kadry. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Скачать книгу

rel="nofollow" href="#ulink_c11848e8-4428-5f5b-a6d1-8151b8396818">Figure 2.11 illustrates the combination of sharding and replication where the data set is split up into shard A and shard B. Shard A is replicated across node A and node B; similarly shard B is replicated across node C and node D.

Figure 2.10 Peer‐to‐peer model.

Figure 2.11 Combination of sharding and replication.

2.3 Distributed File System

A file system is a way of storing and organizing the data on storage devices such as hard drives, DVDs, and so forth, and to keep track of the files stored on them. The file is the smallest unit of storage defined by the file system to pile the data. These file systems store and retrieve data for the application to run effectively and efficiently on the operating systems. A distributed file system stores the files across cluster nodes and allows the clients to access the files from the cluster. Though physically the files are distributed across the nodes, logically it appears to the client as if the files are residing on their local machine. Since a distributed file system provides access to more than one client simultaneously, the server has a mechanism to organize updates for the clients to access the current updated version of the file, and no version conflicts arise. Big data widely adopts a distributed file system known as Hadoop Distributed File System (HDFS).

The key concept of a distributed file system is the data replication where the copies of data called replicas are distributed on multiple cluster nodes so that there is no single point of failure, which increases the reliability. The client can communicate with any of the closest available nodes to reduce latency and network traffic. Fault tolerance is achieved through data replication as the data will not be lost in case of node failure due to the redundancy in the data across nodes.

2.4 Relational and Non‐Relational Databases

Relational databases organize data into tables of rows and columns. The rows are called records, and the columns are called attributes or fields. A database with only one table is called a flat database, while a database with two or more tables that are related is called a relational database. Table 2.1 shows a simple table that stores the details of the students registering for the courses offered by an institution.

In the above example, the table holds the details of the students and CourseId of the courses for which the students have registered. The above table meets the basic needs to keep track of the courses for which each student has registered. But it has some serious flaws in accordance with efficiency and space utilization. For example, when a student registers for more than one course, then details of the student has to be entered for every course he registers. This can be overcome by dividing the data across multiple related tables. Figure 2.12 represents the data in the above table is divided among multiple related tables with unique primary and foreign keys.

Relational tables have attributes that uniquely identify each row. The attributes which uniquely identify the tuples are called primary key. StudentId is the primary key, and hence its value should be unique. Attribute in one table that references to the primary key in another table is called foreign key. CourseId in RegisteredCourse is a foreign key, which references to CourseId in the CoursesOffered table.

Table 2.1 Student course registration database.

Figure 2.12 Data divided across multiple related tables.

Relational databases become unsuitable when organizations collect vast amount of customer databases, transactions, and other data, which may not be structured to fit into relational databases. This has led to the evolution of non‐relational databases, which are schema‐less. NoSQL is a non‐relational database and a few frequently used NoSQL databases are Neo4J, Redis, Cassandra, and MongoDb. Let us have a quick look at the properties of RDBMS and NoSQL databases.

2.4.1 RDBMS Databases

RDBMS is vertically scalable and exhibits ACID (atomicity, consistency, isolation, durability) properties and support data that adhere to a specific schema. This schema check is made at the time of inserting or updating data, and hence they are not ideal for capturing and storing data arriving at high velocity. The architectural limitation of RDBMS makes it unsuitable for big data solutions as a primary storage device.

For the past decades, relational database management systems that were running in corporate data centers have stored the bulk of the world’s data. But with the increase in volume of the data, RDBMS can no longer keep pace with the volume, velocity, and variety of data being generated and consumed.

Big data, which is typically a collection of data with massive volume and variety arriving at a high velocity, cannot be effectively managed with traditional data management tools. While conventional databases are still existing and used in a large number of applications, one of the key advancements in resolving the problems with big data is the emergence of modern alternate database technologies that do not require any fixed schema to store data; rather, the data is distributed across the storage paradigm. The main alternative databases are NoSQL and NewSQL databases.

2.4.2 NoSQL Databases

A NoSQL (Not Only SQL) database includes all non‐relational databases. Unlike RDBMS, which exhibits ACID properties, a NoSQL database follows the CAP theorem (consistency, availability, partition tolerance) and exhibits the BASE (basically, available, soft state, eventually consistent) model, where the storage devices do not provide immediate consistency; rather, they provide eventual consistency. Hence, these databases are not appropriate for implementing large transactions.

The various types of NoSQL databases, namely, Key‐value databases, document databases, column‐oriented databases, graph databases, were discussed in detail in Section 2.3. Table 2.2 shows examples of various types of NoSQL databases.

Table 2.2 Popular NoSQL databases.

Key‐value databases	Document databases	Column databases	Graph databases
Redis	MongoDB	DynamoDB	Neo4j
Riak	Скачать книгу В начало < 14 15 16 17 18 19 20 21 > В конец e-mail: [email protected]

Big Data. Seifedine Kadry

Чтение книги онлайн.

Читать онлайн книгу Big Data - Seifedine Kadry страница 19

Информация о книге:

2.3 Distributed File System

2.4 Relational and Non‐Relational Databases

2.4.1 RDBMS Databases

2.4.2 NoSQL Databases