Data Lakes For Dummies. Alan R. Simon
Чтение книги онлайн.
Читать онлайн книгу Data Lakes For Dummies - Alan R. Simon страница 12
One or (usually) more mechanisms to move data from one part of the data lake to another.
A catalog or directory that helps keep track of what data is where, as well as the associated rules that apply to different groups of data; this is known as metadata.
Capabilities that help unify meanings and business rules for key data subjects that may come into the data lake from different applications and systems; this is known as master data management.
Monitoring services to track data quality and accuracy, response time when users access data, billing services to charge different organizations for their usage of the data lake, and plenty more.
Different types of data
If your data lake had a motto, it might be “All data are created equal.”
In a data lake, data is data is data. In other words, you don’t need to make special accommodations for more complex types of data than you would for simpler forms of data.
Your data lake will contain structured data, unstructured data, and semi-structured data (see Figure 1-3). The following sections cover these types of data in more detail.
Structured data: Staying in your own lane
You’re probably most familiar with structured data, which is made up of numbers, shorter-length character strings, and dates. Traditionally, most of the applications you’ve worked with have been based on structured data. Structured data is commonly stored in a relational database such as Microsoft SQL Server, MySQL, or Oracle Database.
FIGURE 1-3: Different types of data in your data lake.
In a database, you define columns (basically, fields) for each of your pieces of structured data, and each column is rigidly and precisely defined with the following:
A data type, such as INTEGER, DECIMAL, CHARACTER, DATE, DATETIME, or something similar
The size of the field, either explicitly declared (for example, how many characters a CHARACTER column will contain) or implicitly declared (the system-defined maximum number for an INTEGER or how a DATE column is structured)
Any specific rules that apply to a data column or field, such as the permissible range of values (for example, a customer’s age must be between 18 and 130) or a list of allowable values (for example, an employee’s current status can only be FULL-TIME, PART-TIME, TERMINATED, or RETIRED)
Any additional constraints, such as primary and foreign key designations, or referential integrity (rules that specify consistency for certain columns across multiple database tables)
Unstructured data: A picture may be worth ten million words
Unstructured data is, by definition, data that lacks a formally defined structure. Images (such as JPEGs), audio (such as MP3s), and videos (such as MP4s or MOVs) are common forms of unstructured data.
Semi-structured data: Stuck in the middle of the lake
Semi-structured data sort of falls in between structured and unstructured data. Examples include a blog post, a social media post, text messages, an email message, or a message from Slack or Microsoft Teams. Leaving aside any embedded or attached images or videos for a moment, all these examples consist of a long string of letters, numbers, and special characters. However, there’s no particular structure assigned to most of these text strings other than perhaps a couple of lines of heading information. The body of an email may be very short — only a line or two — while another email can go on for many long paragraphs.
In your data lake, you need to have all these types of data sitting side by side. Why? Because you’ll be running analytics against the data lake that may need more than one form of data. For example, you receive and then analyze a detailed report of sales by department in a large department store during the past month.
Then, after noticing a few anomalies in the sales numbers, you pull up in-store surveillance video to analyze traffic versus sales to better understand how many customers may be looking at merchandise but deciding not to make a purchase. You can even combine structured data from scanners with your unstructured video data as part of your analysis.
If you had to go to different data storage environments for your sales results (structured data) and then the video surveillance (unstructured data), your overall analysis is dramatically slowed down, especially if you need to integrate and cross-reference different types of data. With a data lake, all this data is sitting side by side, ready to be delivered for analysis and decision-making.
In their earliest days, relational databases only stored structured data. Later, they were extended with capabilities to store structured and unstructured data. Binary large objects (BLOBs) were a common way to store images and even video in a relational database. However, even an object-extended relational database doesn’t make a good platform for a data lake when compared with modern data services such as Amazon S3 or Microsoft ADLS.
Different water, different data
A common misconception is that you store “all your data” in your data lake. Actually, you store all or most of your analytic data in a data lake. Analytic data is, as you may suspect from the name, data that you’re using for analytics. In contrast, you use operational data to run your business.
What’s the difference? From one perspective, operational and analytic data are one and the same. Suppose you work for a large retailer. A customer comes into one of your stores and makes some purchases. Another customer goes onto your company’s website and buys some items there. The records of those sales — which customers made the purchases, which products they bought, how many of each product, the dates of the sales, whether the sales were online or in a store, and so on — are all stored away as official records of those transactions, which are necessary for running your company’s operations.
But you also want to analyze that data, right? You want to understand which products are selling the best and where. You want to understand which customers are spending the most. You have dozens or even hundreds of questions you want to ask about your customers and their purchasing activity.
Here’s the catch: You need to make copies of your operational data for the deep analysis that you need to undertake; and the copies of that operational data are what goes into the data lake (see Figure 1-4). FIGURE 1-4: Source applications feeding