Data Science For Dummies. Lillian Pierson
Чтение книги онлайн.
Читать онлайн книгу Data Science For Dummies - Lillian Pierson страница 16
FIGURE 2-1: Popular sources of big data.
Grasping the Differences among Data Approaches
Data science, machine learning engineering, and data engineering cover different functions within the big data paradigm — an approach wherein huge velocities, varieties, and volumes of structured, unstructured, and semistructured data are being captured, processed, stored, and analyzed using a set of techniques and technologies that are completely novel compared to those that were used in decades past.
All these functions are useful for deriving knowledge and actionable insights from raw data. All are essential elements for any comprehensive decision-support system, and all are extremely helpful when formulating robust strategies for future business growth. Although the terms data science and data engineering are often used interchangeably, they’re distinct domains of expertise. Over the past five years, the role of machine learning engineer has risen up to bridge a gap that exists between data science and data engineering. In the following sections, I introduce concepts that are fundamental to data science and data engineering, as well as the hybrid machine learning engineering role, and then I show you the differences in how these roles function in an organization’s data team.
Defining data science
If science is a systematic method by which people study and explain domain-specific phenomena that occur in the natural world, you can think of data science as the scientific domain that’s dedicated to knowledge discovery via data analysis.
With respect to data science, the term domain-specific refers to the industry sector or subject matter domain that data science methods are being used to explore.
Data scientists use mathematical techniques and algorithmic approaches to derive solutions to complex business and scientific problems. Data science practitioners use its predictive methods to derive insights that are otherwise unattainable. In business and in science, data science methods can provide more robust decision-making capabilities:
In business, the purpose of data science is to empower businesses and organizations with the data insights they need in order to optimize organizational processes for maximum efficiency and revenue generation.
In science, data science methods are used to derive results and develop protocols for achieving the specific scientific goal at hand.
Data science is a vast and multidisciplinary field. To call yourself a true data scientist, you need to have expertise in math and statistics, computer programming, and your own domain-specific subject matter.
Using data science skills, you can do cool things like the following:
Use machine learning to optimize energy usage and lower corporate carbon footprints.
Optimize tactical strategies to achieve goals in business and science.
Predict for unknown contaminant levels from sparse environmental datasets.
Design automated theft- and fraud-prevention systems to detect anomalies and trigger alarms based on algorithmic results.
Craft site-recommendation engines for use in land acquisitions and real estate development.
Implement and interpret predictive analytics and forecasting techniques for net increases in business value.
Data scientists must have extensive and diverse quantitative expertise to be able to solve these types of problems.
Machine learning is the practice of applying algorithms to learn from — and make automated predictions from — data.Defining machine learning engineering
A machine learning engineer is essentially a software engineer who is skilled enough in data science to deploy advanced data science models within the applications they build, thus bringing machine learning models into production in a live environment like a Software as a Service (SaaS) product or even just a web page. Contrary to what you may have guessed, the role of machine learning engineer is a hybrid between a data scientist and a software engineer, not a data engineer. A machine learning engineer is, at their core, a well-rounded software engineer who also has a solid foundation in machine learning and artificial intelligence. This person doesn’t need to know as much data science as a data scientist but should know much more about computer science and software development than a typical data scientist.
Software as a Service (SaaS) is a term that describes cloud-hosted software services that are made available to users via the Internet. Examples of popular SaaS companies include Salesforce, Slack, HubSpot, and so many more.
Defining data engineering
If engineering is the practice of using science and technology to design and build systems that solve problems, you can think of data engineering as the engineering domain that’s dedicated to building and maintaining data systems for overcoming data processing bottlenecks and data handling problems that arise from handling the high volume, velocity, and variety of big data.
Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big datasets. Data engineers often have experience working with (and designing) real-time processing frameworks and massively parallel processing (MPP) platforms (discussed later in this chapter), as well as with RDBMSs. They generally code in Java, C++, Scala, or Python. They know how to deploy Hadoop MapReduce or Spark to handle, process, and refine big data into datasets with more manageable sizes. Simply put, with respect to data science, the purpose of data engineering is to engineer large-scale data solutions by building coherent, modular, and scalable data processing platforms from which data scientists can subsequently derive insights.
Most engineered systems are built systems — they are constructed or manufactured in the physical world. Data engineering is different, though. It involves designing, building, and implementing software solutions to problems in the data world — a world that can seem abstract when compared to the physical reality of the Golden Gate Bridge or the Aswan Dam.Using data engineering skills, you can, for example:
Integrate data pipelines with the natural language processing (NLP) services that were built by data scientists at your company.
Build mission-critical data platforms capable of processing more than 10 billion transactions per day.
Tear down data silos by finally migrating your company’s data from a more traditional