Official Google Cloud Certified Professional Data Engineer Study Guide. Dan Sullivan
Чтение книги онлайн.
Читать онлайн книгу Official Google Cloud Certified Professional Data Engineer Study Guide - Dan Sullivan страница 13
At the other end of the velocity spectrum are low-velocity migrations or archiving operations. For example, an organization that uses the Transfer Appliance for large-scale migration may wait days before the data is available in Cloud Storage.
Variation in Structure
Another key attribute to consider when choosing a storage technology is the amount of variation that you expect in the data structure. Some data structures have low variance. For example, a weather sensor that sends temperature, humidity, and pressure readings at regular time intervals has virtually no variation in the data structure. All data sent to the storage system will have those three measures unless there is an error, such as a lost network packet or corrupted data.
Many business applications that use relational databases also have limited variation in data structure. For example, all customers have most attributes in common, such as name and address, but other business applications may have name suffixes, such as M.D. and Ph.D., stored in an additional field. In those cases, it is common to allow NULL values for attributes that may not be needed.
Not all business applications fit well into the rigid structure of strictly relational databases. NoSQL databases, such as MongoDB, CouchDB, and OrientDB, are examples of document databases. These databases use sets of key-value pairs to represent varying attributes. For example, instead of having a fixed set of attributes, like a relational database table, they include the attribute name along with the attribute value in the database (see Table 1.1).
Table 1.1 Example of structured, relational data
First_name | Last_name | Street_Address | City | Postal_Code |
Michael | Johnson | 334 Bay Rd | Santa Fe | 87501 |
Wang | Li | 74 Alder St | Boise | 83701 |
Sandra | Connor | 123 Main St | Los Angeles | 90014 |
The data in the first row would be represented in a document database using a structure something like the following:
{ ’first_name’: ’Michael’, ’last_name’: ’Johnson’. ’street’_address’: ’334 Bay Rd’, ’city’: ’Santa Fe’, ’postal_code’: ’87501’ }
Since most rows in a table of names and addresses will have the same attributes, it is not necessary to use a data structure like a document structure. Consider the case of a product catalog that lists both appliances and furniture. Here is an example of how a dishwasher and a chair might be represented:
{ {’id’: ’123456’, ’product_type’: ’dishwasher’, ’length’: ’24 in’, ’width’: ’34 in’, ’weight’: ’175 lbs’, ’power’: ’1800 watts’ } {’id’:’987654’, ’product_type’: ’chair’, ’weight’: ’15 kg’, ’style’: ’modern’, ’color’: ’brown’ } }
In addition to document databases, wide-column databases, such as Bigtable and Cassandra, are also used with datasets with varying attributes.
Data Access Patterns
Data is accessed in different ways for different use cases. Some time-series data points may be read immediately after they are written, but they are not likely to be read once they are more than a day old. Customer order data may be read repeatedly as an order is processed. Archived data may be accessed less than once a year. Four metrics to consider about data access are as follows:
How much data is retrieved in a read operation?
How much data is written in an insert operation?
How often is data written?
How often is data read?
Some read and write operations apply to small amounts of data. Reading or writing a single piece of telemetry data is an example. Writing an e-commerce transaction may also entail a small amount of data. A database storing telemetry data from thousands of sensors that push data every five seconds will be writing large volumes, whereas an online transaction processing database for a small online retailer will also write small individual units of data but at a much smaller rate. These will require different kinds of databases. The telemetry data, for example, is better suited to Bigtable, with its low-latency writes, and the retailer transaction data is a good use case for Cloud SQL, with support for sufficient I/O operations to handle relational database loads.
Cloud Storage supports ingesting large volumes of data in bulk using tools such as the Cloud Transfer Service and Transfer Appliance. (Cloud Storage also supports streaming transfers, but bulk reads and writes are more common.) Data in Cloud Storage is read at the object or the file level. You typically don’t, for example, seek a particular block within a file as you can when storing a file on a filesystem.
It is common to read large volumes of data in BigQuery as well; however, in that case we often read a small number of columns across a large number of rows. BigQuery optimizes for these kinds of reads by using a columnar storage format known as Capacitor. Capacitor is designed to store semi-structured data with nested and repeated fields.
Data access patterns can help identify the best storage technology for a use case by highlighting key features needed to support those access patterns.
Security Requirements
Different storage systems will have different levels of access controls. Cloud Storage, for example, can have access controls at the bucket and the object level. If someone has access to a file in Cloud Storage, they will have access to all the data in that file. If some users have access only to a subset of a dataset, then the data could be stored in a relational database and a view could be created that includes only the data that the user is allowed to access.
Encrypting data at rest is an important requirement for many use cases; fortunately, all Google Cloud storage services encrypt data at rest.
When choosing a storage technology, the ability to control access to data is a key consideration.
Types of Structure: Structured, Semi-Structured, and Unstructured
For the purposes of choosing a storage technology, it is helpful to consider how data is structured. There are three widely recognized categories:
Structured
Semi-structured
Unstructured
These categories are particularly