Data Science For Dummies. Lillian Pierson
Чтение книги онлайн.
Читать онлайн книгу Data Science For Dummies - Lillian Pierson страница 12
No matter how the data is combined or where it’s stored, if you’re a data scientist, you almost always have to query data — write commands to extract relevant datasets from data storage systems, in other words. Most of the time, you use Structured Query Language (SQL) to query data. (Chapter 7 is all about SQL, so if the acronym scares you, jump ahead to that chapter now.)
Whether you’re using a third-party application or doing custom analyses by using a programming language such as R or Python, you can choose from a number of universally accepted file formats:
Comma-separated values (CSV): Almost every brand of desktop and web-based analysis application accepts this file type, as do commonly used scripting languages such as Python and R.
Script: Most data scientists know how to use either the Python or R programming language to analyze and visualize data. These script files end with the extension .ply or .ipynb (Python) or .r (R).
Application: Excel is useful for quick-and-easy, spot-check analyses on small- to medium-size datasets. These application files have the .xls or .xlsx extension.
Web programming: If you're building custom, web-based data visualizations, you may be working in D3.js — or data-driven documents, a JavaScript library for data visualization. When you work in D3.js, you use data to manipulate web-based documents using .html, .svg, and .css files.
Applying mathematical modeling to data science tasks
Data science relies heavily on a practitioner's math skills (and statistics skills, as described in the following section) precisely because these are the skills needed to understand your data and its significance. These skills are also valuable in data science because you can use them to carry out predictive forecasting, decision modeling, and hypotheses testing.
Mathematics uses deterministic methods to form a quantitative (or numerical) description of the world; statistics is a form of science that’s derived from mathematics, but it focuses on using a stochastic (probabilities) approach and inferential methods to form a quantitative description of the world. I tell you more about both in Chapter 4. Data scientists use mathematical methods to build decision models, generate approximations, and make predictions about the future. Chapter 4 presents many mathematical approaches that are useful when working in data science.
In this book, I assume that you have a fairly solid skill set in basic math — you will benefit if you’ve taken college-level calculus or even linear algebra. I try hard, however, to meet readers where they are. I realize that you may be working based on a limited mathematical knowledge (advanced algebra or maybe business calculus), so I convey advanced mathematical concepts using a plain-language approach that’s easy for everyone to understand.
Deriving insights from statistical methods
In data science, statistical methods are useful for better understanding your data’s significance, for validating hypotheses, for simulating scenarios, and for making predictive forecasts of future events. Advanced statistical skills are somewhat rare, even among quantitative analysts, engineers, and scientists. If you want to go places in data science, though, take some time to get up to speed in a few basic statistical methods, like linear and logistic regression, naïve Bayes classification, and time series analysis. These methods are covered in Chapter 4.
Coding, coding, coding — it’s just part of the game
Coding is unavoidable when you’re working in data science. You need to be able to write code so that you can instruct the computer in how to manipulate, analyze, and visualize your data. Programming languages such as Python and R are important for writing scripts for data manipulation, analysis, and visualization. SQL, on the other hand, is useful for data querying. Finally, the JavaScript library D3.js is often required for making cool, custom, and interactive web-based data visualizations.
Although coding is a requirement for data science, it doesn’t have to be this big, scary thing that people make it out to be. Your coding can be as fancy and complex as you want it to be, but you can also take a rather simple approach. Although these skills are paramount to success, you can pretty easily learn enough coding to practice high-level data science. I’ve dedicated Chapters 6 and 7 to helping you get to know the basics of what’s involved in getting started in Python and R, and querying in SQL (respectively).
Applying data science to a subject area
Statisticians once exhibited some measure of obstinacy in accepting the significance of data science. Many statisticians have cried out, “Data science is nothing new — it’s just another name for what we’ve been doing all along!” Although I can sympathize with their perspective, I’m forced to stand with the camp of data scientists who markedly declare that data science is separate, and definitely distinct, from the statistical approaches that comprise it.
My position on the unique nature of data science is based to some extent on the fact that data scientists often use computer languages not used in traditional statistics and take approaches derived from the field of mathematics. But the main point of distinction between statistics and data science is the need for subject matter expertise.
Because statisticians usually have only a limited amount of expertise in fields outside of statistics, they’re almost always forced to consult with a SME to verify exactly what their findings mean and to determine the best direction in which to proceed. Data scientists, on the other hand, should have a strong subject matter expertise in the area in which they’re working. Data scientists generate deep insights and then use their domain-specific expertise to understand exactly what those insights mean with respect to the area in which they’re working.
The following list describes a few ways in which today’s knowledge workers are coupling data science skills with their respective areas of expertise in order to amplify the results they generate.
Clinical informatics scientists combine their healthcare expertise with data science skills to produce personalized healthcare treatment plans. They use healthcare informatics to predict and preempt future health problems in at-risk patients.
Marketing data scientists combine data science with marketing expertise to predict and preempt customer churn (the loss of customers from a product or service to that of a competitor’s, in other words). They also optimize marketing strategies, build recommendation engines, and fine-tune marketing mix models. I tell you more about using data science to increase marketing ROI in Chapter 11.
Data