Data-Intensive Text Processing with MapReduce. Jimmy Lin
Чтение книги онлайн.
Читать онлайн книгу Data-Intensive Text Processing with MapReduce - Jimmy Lin страница 8
Today, however, the von Neumann model isn’t sufficient anymore: we can’t treat a multi-core processor or a large cluster as an agglomeration of many von Neumann machine instances communicating over some interconnect. Such a view places too much burden on the software developer to effectively take advantage of available computational resources—it simply is the wrong level of abstraction. MapReduce can be viewed as the first breakthrough in the quest for new abstractions that allow us to organize computations, not over individual machines, but over entire clusters. As Barroso puts it, the datacenter is the computer [18; 119].
To be fair, MapReduce is certainly not the first model of parallel computation that has been proposed. The most prevalent model in theoretical computer science, which dates back several decades, is the PRAM [60; 77].15 In the model, an arbitrary number of processors, sharing an unboundedly large memory, operate synchronously on a shared input to produce some output. Other models include LogP [43] and BSP [148]. For reasons that are beyond the scope of this book, none of these previous models have enjoyed the success that MapReduce has in terms of adoption and in terms of impact on the daily lives of millions of users.16
MapReduce is the most successful abstraction over large-scale computational resources we have seen to date. However, as anyone who has taken an introductory computer science course knows, abstractions manage complexity by hiding details and presenting well-defined behaviors to users of those abstractions. They, inevitably, are imperfect—making certain tasks easier but others more difficult, and sometimes, impossible (in the case where the detail suppressed by the abstraction is exactly what the user cares about). This critique applies to MapReduce: it makes certain large-data problems easier, but suffers from limitations as well. This means that MapReduce is not the final word, but rather the first in a new class of programming models that will allow us to more effectively organize computations at a massive scale.
So if MapReduce is only the beginning, what’s next beyond MapReduce? We’re getting ahead of ourselves, as we can’t meaningfully answer this question before thoroughly understanding what MapReduce can and cannot do well. This is exactly the purpose of this book: let us now begin our exploration.
1.4 WHAT THIS BOOK IS NOT
Actually, not quite yet…A final word before we get started. This book is about MapReduce algorithm design, particularly for text processing (and related) applications. Although our presentation most closely follows the Hadoop open-source implementation of MapReduce, this book is explicitly not about Hadoop programming. We don’t, for example, discuss APIs, command-line invocations for running jobs, etc. For those aspects, we refer the reader to Tom White’s excellent book, “Hadoop: The Definitive Guide” [154].
1 http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
2 http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/
3 http://public.web.cern.ch/public/en/LHC/Computing-en.html
4As an aside, it is interesting to observe the evolving definition of large over the years. Banko and Brill’s paper in 2001 was titled Scaling to Very Very Large Corpora for Natural Language Disambiguation, and dealt with a corpus containing a billion words.
5As in, so stupid it couldn’t possibly work.
6This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences [155]. This is somewhat ironic in that the original article lauded the beauty and elegance of mathematical models in capturing natural phenomena, which is the exact opposite of the data-driven approach.
7On Exactitude in Science [23]. A similar exchange appears in Chapter XI of Sylvie and Bruno Concluded by Lewis Carroll (1893).
8What is the difference between cloud computing and grid computing? Although both tackle the fundamental problem of how best to bring computational resources to bear on large and difficult problems, they start with different assumptions. Whereas clouds are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number of datacenters controlled by a single organization, grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct but cooperative organizations. As a result, grid computing tends to deal with tasks that are coarser-grained, and must deal with the practicalities of a federated environment, e.g., verifying credentials across multiple administrative domains. Grid computing has adopted a middleware-based approach for tackling many of these challenges.
9The first example is Facebook, a well-known user of Hadoop, in exactly the manner as described [68]. The second is, of course, Google, which uses MapReduce to continuously improve existing algorithms and to devise new algorithms for ad selection and placement.
10Adapted from a post by Ted Dunning on the Hadoop mailing list.
11For more detail, Jacobs [76] provides real-world benchmarks in his discussion of large-data problems.
12See also DeWitt and Gray [50] for slightly different definitions in terms of speedup and scaleup.
13Note that this idea meshes well with utility computing, where a 100-machine cluster running for one hour would cost the same as a 10-machine cluster running for ten hours.
14Guess when this was written? You may be surprised.
15More than a theoretical model, the PRAM has been recently prototyped in hardware [153].
16Nevertheless, it is important to understand the relationship between MapReduce and existing models so that we can bring to bear accumulated knowledge about parallel algorithms; for example, Karloff et al. [82] demonstrated that a large class of PRAM algorithms can be efficiently simulated via MapReduce.
CHAPTER 2
MapReduce Basics
The only feasible approach to tackling large-data problems today is to divide and conquer, a fundamental concept in computer science that is introduced very early in typical undergraduate curricula. The basic idea is to partition a large problem into smaller sub-problems. To the extent that the sub-problems are independent [5], they can be tackled in parallel by different workers—threads in a processor core, cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster. Intermediate results from each individual worker are then combined to yield the final output.1
The