Official Google Cloud Certified Professional Data Engineer Study Guide. Dan Sullivan

Чтение книги онлайн.

Читать онлайн книгу Official Google Cloud Certified Professional Data Engineer Study Guide - Dan Sullivan страница 13

Official Google Cloud Certified Professional Data Engineer Study Guide - Dan  Sullivan

Скачать книгу

      At the other end of the velocity spectrum are low-velocity migrations or archiving operations. For example, an organization that uses the Transfer Appliance for large-scale migration may wait days before the data is available in Cloud Storage.

      Variation in Structure

      Many business applications that use relational databases also have limited variation in data structure. For example, all customers have most attributes in common, such as name and address, but other business applications may have name suffixes, such as M.D. and Ph.D., stored in an additional field. In those cases, it is common to allow NULL values for attributes that may not be needed.

First_name Last_name Street_Address City Postal_Code
Michael Johnson 334 Bay Rd Santa Fe 87501
Wang Li 74 Alder St Boise 83701
Sandra Connor 123 Main St Los Angeles 90014

      The data in the first row would be represented in a document database using a structure something like the following:

      { ’first_name’: ’Michael’, ’last_name’: ’Johnson’. ’street’_address’: ’334 Bay Rd’, ’city’: ’Santa Fe’, ’postal_code’: ’87501’ }

      { {’id’: ’123456’, ’product_type’: ’dishwasher’, ’length’: ’24 in’, ’width’: ’34 in’, ’weight’: ’175 lbs’, ’power’: ’1800 watts’ } {’id’:’987654’, ’product_type’: ’chair’, ’weight’: ’15 kg’, ’style’: ’modern’, ’color’: ’brown’ } }

      In addition to document databases, wide-column databases, such as Bigtable and Cassandra, are also used with datasets with varying attributes.

      Data Access Patterns

      Data is accessed in different ways for different use cases. Some time-series data points may be read immediately after they are written, but they are not likely to be read once they are more than a day old. Customer order data may be read repeatedly as an order is processed. Archived data may be accessed less than once a year. Four metrics to consider about data access are as follows:

       How much data is retrieved in a read operation?

       How much data is written in an insert operation?

       How often is data written?

       How often is data read?

      Some read and write operations apply to small amounts of data. Reading or writing a single piece of telemetry data is an example. Writing an e-commerce transaction may also entail a small amount of data. A database storing telemetry data from thousands of sensors that push data every five seconds will be writing large volumes, whereas an online transaction processing database for a small online retailer will also write small individual units of data but at a much smaller rate. These will require different kinds of databases. The telemetry data, for example, is better suited to Bigtable, with its low-latency writes, and the retailer transaction data is a good use case for Cloud SQL, with support for sufficient I/O operations to handle relational database loads.

      It is common to read large volumes of data in BigQuery as well; however, in that case we often read a small number of columns across a large number of rows. BigQuery optimizes for these kinds of reads by using a columnar storage format known as Capacitor. Capacitor is designed to store semi-structured data with nested and repeated fields.

      Data access patterns can help identify the best storage technology for a use case by highlighting key features needed to support those access patterns.

      Security Requirements

      Different storage systems will have different levels of access controls. Cloud Storage, for example, can have access controls at the bucket and the object level. If someone has access to a file in Cloud Storage, they will have access to all the data in that file. If some users have access only to a subset of a dataset, then the data could be stored in a relational database and a view could be created that includes only the data that the user is allowed to access.

      Encrypting data at rest is an important requirement for many use cases; fortunately, all Google Cloud storage services encrypt data at rest.

      When choosing a storage technology, the ability to control access to data is a key consideration.

      Types of Structure: Structured, Semi-Structured, and Unstructured

      For the purposes of choosing a storage technology, it is helpful to consider how data is structured. There are three widely recognized categories:

       Structured

       Semi-structured

       Unstructured

      These categories are particularly

Скачать книгу