Official Google Cloud Certified Professional Data Engineer Study Guide. Dan Sullivan

Чтение книги онлайн.

Читать онлайн книгу Official Google Cloud Certified Professional Data Engineer Study Guide - Dan Sullivan страница 14

Official Google Cloud Certified Professional Data Engineer Study Guide - Dan  Sullivan

Скачать книгу

Data Structured data has a fixed set of attributes that can be modeled in a table of rows and columns.

      Unstructured Data Unstructured data does not fit into a tabular structure. Images and audio files are good examples of unstructured data. In between these two extremes lies semi-structured data, which has characteristics of both structured and unstructured.

      Structured: Transactional vs. Analytical

      Structured data can be represented in tables of columns and rows, where columns are attributes and rows are records or entities. Table 1.1 showed an example of structured data. Structured data may be oriented to transactional processes or analytical use cases.

      Transactional structured data is often operated on one row at a time. For example, a business application may look up a customer’s account information from a customer table when displaying data about the customer’s shipping address. Multiple columns from a single row will be used, so it is efficient to store all row attributes together in a data block. Retrieving a single data block will retrieve all the needed data. This is a common pattern in transactional databases such as Cloud SQL and Cloud Spanner, which use row-oriented storage.

      Now consider a data warehousing example. A business analyst is working with a sales data mart and wants to understand how sales last month compare to the same period last year. The data mart has one row for each product on each date, which include the following attributes in addition to product and date: number of units sold, total revenue for units sold, average unit price, average marginal revenue, and total marginal revenue. The analyst is only interested in the monthly sums of total revenue for units sold for each product. In this case, the analyst would query many rows and only three columns. Instead of retrieving the full row for all rows selected, it is more efficient to retrieve only the date, product, and total revenue of units sold columns. This is a common pattern in analytical applications and the reason why BigQuery uses a column-oriented storage mechanism.

      Semi-Structured: Fully Indexed vs. Row Key Access

      Semi-structured data, as noted earlier, does not follow a fixed tabular format and instead stores schema attributes along with the data. In the case of document databases, this allows developers to add attributes as needed without making changes to a fixed database schema. Two ways of storing semi-structured data are as documents or as wide columns. An important distinction between the two is how data is retrieved from them.

      Fully Indexed, Semi-Structured Data

      { {’id’: ’123456’, ’product_type’: ’dishwasher’, ’length’: ’24 in’, ’width’: ’34 in’, ’weight’: ’175 lbs’, ’power’: ’1800 watts’ } {’id’:’987654’, ’product_type’: ’chair’, ’weight’: ’15 kg’, ’style’: ’modern’, ’color’: ’brown’ } }

      To search efficiently by attributes, document databases allow for indexes. If you use Cloud Datastore, for example, you could create indexes on each of the attributes as well as a combination of attributes. Indexes should be designed to support the way that data is queried. If you expect users to search for chairs by specifying style and color together, then you should create a style and color index. If you expect customers to search for appliances by their power consumption, then you should create an index on power.

      Creating a large number of indexes can significantly increase the amount of storage used. In fact, it is not surprising to have total index storage greater than the amount of storage used to store documents. Also, additional indexes can negatively impact performance for insert, update, and delete operations, because the indexes need to be revised to reflect those operations.

      Row Key Access

      Wide-column databases usually take a different approach to querying. Rather than using indexes to allow efficient lookup of rows with needed data, wide-column databases organize data so that rows with similar row keys are close together. Queries use a row key, which is analogous to a primary key in relational databases, to retrieve data. This has two implications.

Sensor ID Timestamp Temperature Relative humidity Pressure
789 1571760690 40 35 28.2
790 1571760698 42.5 50 29.1
791 1571760676 37 61 28.6

Скачать книгу

Timestamp Sensor ID Temperature Relative humidity Pressure