Data Lakes For Dummies. Alan R. Simon
Чтение книги онлайн.
Читать онлайн книгу Data Lakes For Dummies - Alan R. Simon страница 15
The original three Vs of big data came from a Gartner Group analyst named Doug Laney, way back in 2001. Volume, variety, and velocity were primarily aspirational characteristics of data environments, describing next-generational characteristics beyond what the relational databases of the time were capable of supporting.
Over the years, other industry analysts, bloggers, consultants, and product vendors added to the list with their own Vs. The difference between the original three Vs and those that followed, though, is that value, veracity, visualization, and others all apply to tried-and-true relational technology just as much as to big data.
Don’t get confused trying to decide how many Vs apply to big data and to data lakes. Just focus on the original three — volume, variety, and velocity — as the must-have characteristics of your data lake.
You’ll find varying perspectives on the relationship between big data and data lakes, which certainly confuses the issue. Some technologists reverse the relationship between big data and data lakes; they consider a data lake to be the core technology and big data to be the overall environment. So, if you run across a blog post or another description that differs from the one I use, don’t worry. As with almost everything about data lakes and much of the technology world, you’ll find all sorts of opinions and perspectives, especially when you don’t have any official standards to govern a discipline.The Hadoop open source environment, particularly the HDFS, is one of the first and most popular examples of big data. Some of the earliest data lakes were built, or at least begun, using HDFS as the foundation.
For purposes of establishing a data lake foundation, Amazon’s S3 and Microsoft’s ADLS both qualify as big data. Why? Both S3 and ADLS support the three Vs of big data, which are as follows:
Storing extremely large volumes of data
Supporting a variety of data, including structured, unstructured, and semi-structured data
Allowing very high velocity for incoming data into the data lake rather than requiring or at least encouraging periodic batches of data
Think of big data as a core technology foundation that supports the three Vs of next-generation data management. Big data by itself, however, is just a platform. It’s the natural body of water — the lake itself — at a popular lakeside resort. When you divide your big data into multiple zones, add capabilities to transmit data across those zones, and then govern the whole environment, you’ve built a data lake surrounding that big data foundation. You’ve done the analytical data equivalent of building the docks, the restaurants, and the boat slips surrounding the lake itself.
The Data Lake Water Gets Murky
In addition to data lakes, you may come across references to data ponds, data puddles, data rivers, data oceans, and data hot tubs. (Just kidding about the last one.) What’s going on here?
Your job when planning, architecting, building, and using a data lake is complicated by the fact that you don’t have an official definition published by some sort of standards body, such as the American National Standards Institute (ANSI) or the International Organization for Standardization (ISO). That means that you or anyone else can define, use, and even publish your own terminology. You can call a smaller portion of a data lake a “data pond” if you want, or refer to a collection of data lakes as a “data ocean.”
Don’t panic! Of all the “data plus a body of water” terms you’ll run across, data lake is by far the most commonly used. All the characteristics of a data lake — solid architecture, support for multiple forms of data, a support ecosystem surrounding the data — apply to what you can call a data pond or any other term.
If William Shakespeare were still around and plied his trade as an enterprise data architect rather than as a writer, he would put it this way: “A data lake by any other name would still be worth the time and effort to build.”
BACK TO THE FUTURE WITH NAME CHANGES
In the early 1990s, data warehousing was the newest and most popular game in town for analytical data management. By the mid-’90s, the concept of a data warehouse was adapted to a data mart — essentially, a smaller-scale data warehouse. The original idea behind a data mart called for the data warehouse feeding a subset of its data into one or more data marts — sort of a “wholesaler-retailer” model.
The first generation of data warehouse projects, especially very large ones, was hallmarked by a high failure rate. By the late ’90s, data warehouses were viewed as large, complex, and expensive efforts that were also very risky. A data mart, on the other hand, was smaller, less complex, and less expensive, and, thus, considered to be less risky.
The need for integrated analytical data was stronger than ever by the end of the ’90s. But just try to get funding for a data warehousing project! Good luck!
Time for plan B.
Data warehouses went out of style for a while. Instead, data marts became the go-to solution for analytic data. No matter how big and complex an environment was, chances are, you’d refer to it as a data mart rather than a data warehouse. In fact, the idea of an independent data mart sprung up, and the original architecture for a data mart — receiving data from a data warehouse rather than directly from source systems — became known as a dependent data mart.
Fast-forward a couple of decades, and it’s back to the future. First, big data sort of evolved into data lakes. Now you have analysts, consultants, and vendors complicating the picture with their own terminology. This won’t be the last time you’ll see shifting names and terminology in the world of analytic data, so stay tuned!
Chapter 2
Planning Your Day (and the Next Decade) at the Data Lake
IN THIS CHAPTER
Taking advantage of big data
Broadening your data type horizons
Implementing a built-to-last analytical data environment
Reeling in existing stand-alone data marts