Data Lakes For Dummies. Alan R. Simon

Чтение книги онлайн.

Читать онлайн книгу Data Lakes For Dummies - Alan R. Simon страница 21

Data Lakes For Dummies - Alan R. Simon

Скачать книгу

Why did it happen? Diagnostic analytics What’s happening right now? Descriptive analytics What’s likely to happen? Predictive analytics What’s something interesting and important out of this mountain of data? Discovery analytics What are our options? Prescriptive analytics What should we do? Prescriptive analytics

      Mapping your analytics needs to your data lake road map

      Jan, your CPO, is thrilled with the work that Raul and his team have done compiling the HR analytics continuum. They’ve produced an exhaustive list of more than 500 analytical functions that will be supported by the data lake, covering the broad continuum from simple “What happened?” descriptive analytics through more than a dozen complex prescriptive analytics scenarios.

      Now what?

      As you might guess, that 500-plus master list of HR analytics isn’t going to be available the first day your data lake goes operational. A data lake is built in a phased, incremental manner, probably over several years.

      But where to start?

      In Chapter 17, I show you how to build your road map that will take you from your first ideas about your data lake all the way through multiple phases of implementation.

      

Your data lake road map should be driven by your organization’s analytical needs rather than by available data. You should address your highest-impact, highest-value analytics needs first, for two reasons:

       You need the initial operating capability (IOC) of your data lake to come with some “oomph.” In other words, you want people across your organization to sit up and take notice that the data lake is, from its first days, providing some really great analytics.

       You want to build your data lake using a “pipeline” approach that not only loads your data lake with lots of data but carries that data all the way through to critical business insights.

      Building the best data pipelines inside your data lake

Schematic illustration of a data pipeline into, through, and then out of the data lake.

      FIGURE 2-7: A data pipeline into, through, and then out of the data lake.

You can think of a data pipeline in the same context that you may think of shopping. Suppliers sell and ship their products to wholesalers, who then resell and ship some of those products to a wholesaler. The wholesaler then resells and ships the products yet again to a retailer, which is where you come to buy whatever it is that you’re looking for. Figure 2-8 shows how this paradigm can apply to data pipelines within a data lake.

Schematic illustration of an easy way to understand data pipelines and data lakes.

      Addressing future gaps and shortfalls

      Your road map is only the beginning of your data lake journey. You may think you have a pretty good idea of what your data and analytical needs are over the next couple of years, and you do a good job of prioritizing the various phases of how your data lake will be built.

      

The world is constantly changing, though, which means that the farther out your data lake road map stretches, the more likely it is that any given phase will be preempted by changing priorities and new analytical needs.

      As your organization’s analytical needs evolve and — hopefully — become more sophisticated over time, you’ll continually adjust your data lake plans to reflect the real world.

      

Think of a data lake as a living entity that is subject to constant change. Remember that century-long life span of a U.S. Air Force B-52, with changing missions over the years being addressed by constantly incorporating new technology to extend the plane’s value.

      You can stream all kinds of data into your data lake as quickly as that data is created in your source applications. Suppose that you dedicate a portion of your data lake to analyzing your overall computer network traffic and server performance to help you detect possible security threats, network bottlenecks, and database performance slowdowns.

      You’ll be streaming tons of log data from your routers, gateways, firewalls, servers, databases — pretty much any piece of hardware in your enterprise — into your data lake, as quickly as you can as traffic flows across your network and transactions hit your databases. Then, just as quickly, you and your coworkers can analyze the rapidly incoming data and take necessary actions to keep everything running smoothly.

      At the same time, not everything needs to zoom into your data lake at lightning-fast speed. Think about a lake that not only has speedboats zipping all over but also has much larger ferry-type vessels that take hundreds of passengers at a time all around the lake. Some of those ferries also offer evening gourmet dinner cruises in addition to their daytime excursions.

      

You should think of your data lake as a variable-speed transportation engine for your enterprise data. If you need certain data blasted into your data lake as quickly as possible because you need to do immediate analysis, no problem! On the other hand, other data can be batched up and periodically brought into the data lake in bulk, on sort of a time-delayed basis, because you don’t need to do real-time analysis.

Скачать книгу