Official Google Cloud Certified Professional Data Engineer Study Guide. Dan Sullivan

Чтение книги онлайн.

Читать онлайн книгу Official Google Cloud Certified Professional Data Engineer Study Guide - Dan Sullivan страница 15

Official Google Cloud Certified Professional Data Engineer Study Guide - Dan  Sullivan

Скачать книгу

791 37 61 28.6 1571760690 789 40 35 28.2 1571760698 790 42.5 50 29.1

      Unstructured Data

      The distinguishing characteristic of unstructured data is that it does not have a defined schema or data model. Structured data, like relational database tables, has a fixed data model that is defined before data is added to the table. Semi-structured databases include a schema with each row or document in the database. Examples of unstructured data include the following:

       Text files of natural language content

       Audio files

       Video files

       Binary large objects (BLOBs)

      Google’s Storage Decision Tree

      Schema Design Considerations

      Structured and semi-structured data has a schema associated with it. Structured data is usually stored in relational databases whereas semi-structured data is often stored in NoSQL databases. The schema influences how data is stored and accessed, so once you have determined which kind of storage technology to use, you may then need to design a schema that will support optimal storage and retrieval.

      

The distinction between relational and NoSQL databases is becoming less pronounced as each type adopts features of the other. Some relational databases support storing and querying JavaScript Object Notation (JSON) structures, similar to the way that document databases do. Similarly, some NoSQL databases now support ACID (atomicity, consistency, isolation, durability) transactions, which are a staple feature of relational databases.

      Relational Database Design

      OLTP

      Online transaction processing (OLTP) databases are designed for transaction processing and typically follow data normalization rules. There are currently 10 recognized forms of normalization, but most transaction processing systems follow no more than three of those forms:

       The first form of normalization requires that each column in the table have an atomic value, no repeating groups, and a primary key, which is one or more ordered columns that uniquely identify a row.

       The second form of normalization includes the first form and creates separate tables for values that apply to multiple rows and links them using foreign keys. A foreign key is one or more ordered columns that correspond to a primary key in another table.

       The third form of normalization, which includes the second form, eliminates any columns from a table that does not depend on the key.

      These rules of normalization are designed to reduce the risk of data anomalies and to avoid the storage of redundant data. Although they serve those purposes well, they can lead to high levels of I/O operations when joining tables or updating a large number of indexes. Using an OLTP data model requires a balance between following the rules of normalization to avoid anomalies and designing for performance.

      Denormalization—that is, intentionally violating one of the rules of normalization—is often used to improve query performance. For example, repeating customer names in both the customer table and an order table could avoid having to join the two tables when printing invoices. By denormalizing, you can reduce the need to join tables since the data that would have been in another table is stored along with other data in the row of one table.

      OLAP

      Online analytical processing (OLAP) data models are often used for data warehouse and data mart applications. OLAP models are also called dimensional models because data is organized around several dimensions. OLAP models are designed to facilitate the following:

       Rolling up and aggregating data

       Drilling down from summary data to detailed data

       Pivoting and looking at data from different dimensions—sometimes called slicing and dicing

      OLAP can be implemented in relational database or in specialized multidimensional data stores.

      SQL Crash Course

      SQL has three types of statements that developers use:

       Data definition language (DDL) statements, which are used to create and modify database schemas

       Data manipulation language (DML) statements, which are used to insert, update, delete, and query data

       Data query language (DQL) statements, which is a single statement: SELECT

Скачать книгу