The Big R-Book. Philippe J. S. De Brouwer
Чтение книги онлайн.
Читать онлайн книгу The Big R-Book - Philippe J. S. De Brouwer страница 21
The role of the data-analyst in any company cannot be overestimated. It is the reader of the book on whose shoulders rest not only to read those patterns from the data but also to convince decision makers to act in this fact-based insight.
Because the role of data and analytics is so important, it is essential to follow scientific rigour. This means in the first place following the scientific method for data analysis. An interpretation of the scientific method for data-science is in Figure 2.2 on page 10.
Till now we discussed the role of the data scientists and actions that they would take. But how does it look from the point of view of data itself?
Using that scientific method for data-science, the most important thing is probably to make sure that the one understands the data verywell. Data in itself is meaningless. For example, 930 is just a number. It could be anything: fromthe age ofAdamath inGenesis, to the price of chair or the code to unlock your bike-chain. It could be a time and 930 could mean “9:30” (assume “am” if your time-zone habits require so). Knowing that interpretation, the numbers become information, but we cannot understand this information tillwe knowwhat it means (it could be the time Iwoke up – after a long party, the time of a plane to catch, a meeting at work, etc.).We can only understand the data if we know that it is a bus schedule of the bus “843-my-route-to-work” for example. This understanding, together with the insight that this bus always runs 15 minutes late and my will to catch the bus can lead to action: to go out and wait for that bus and get on it.
data
information
insight action
This simple example shows us how the data cycle in any company or within any discipline should work. We first have a question, such as for example “to which customers can we lend money without pushing them into a debt-spiral.” Then one will collect data (from own systems or credit bureau). This data can then be used to create a model that allows us to reduce the complexity of all observations to the explaining variables only: a projection in a space of lower dimensions. That model helps us to get the insight from the data and once put in production allows us to decide on the right action for each credit application.
This institution will end up with a better credit approval process, where less loss events occur. That is the role of data-science: to drive companies to the creation of more sustainable wealth in a future where all have a place and plentifulness.
This cycle – visualized in Figure 2.2 on page 10 – brings into evidence the importance of data-science. Data science is a way to bring the scientific method into a private company, so that decisions do not have to be based on gut-feeling alone. It is the role of the data scientist to take data, transform that data into information, create understanding from that data that can lead to actionable insight. It is then up to the management of the business to decide on the actions and follow them through. The importance of being connected to the reality via contact with the business cannot be overstated. In each and every step, mathematics will serve as tools, such as screwdrivers and hammers. However, the choice about which one to use depends on a good understanding what we are working with and what we are trying to achieve.
Figure 2.2: The role of data-science in a company is to take data and turn it into actionable insight. At every step – apart from technical issues that will be discussed in this book – it is of utmost importance to understand the context and limitations of data, business, regulations and customers. For effectiveness in every step, one needs to pay attention to communication and permanent contact with all stakeholders and environment is key.
Note
1 1 The term “singularity” refers to the point in time where an intelligent system would be able to produce an even more intelligent system that also can create another system that is a certain percentage smarter in a time that is a certain percentage faster. This inevitably leads to exponentially increasing creating of better systems. This time series converges to one point in time, where “intelligence” of themachine would hit its absolute limits. First, record of the subject is by Stanislaw Ulam in a discussion with John Von Neuman in the 1950s and an early and convincing publication is Good (1966). It is also elaborately explored in Kurzweil (2010).
♣3♣ Conventions
This book is formatted with LATEX. The people who know this markup language will have high expectations for the consistency and format of this book. As you can expect there is
1 a table of contents at the start;
2 an index at the end, from page 1103;
3 a bibliography on page 1088;
4 as well as a list of all short-hands and symbols used on page 1117.
This is a book with a programming language as leitmotif and hence you might expect to find a lot of chunks of code. R is an interpreted language and it is usually accessed by opening the software R (simply type R
on the command prompt and press enter).1
# This is code 1+pi ## [1] 4.141593 Sys.getenv(c("EDITOR","USER","SHELL", "LC_NUMERIC")) ## EDITOR USER SHELL LC_NUMERIC ## "vi" "root" "/bin/bash" "pl_PL.UTF-8"
As you can see, the code is highlighted, that means that not all things have the same colour and it is easier to read and understand what is going on. The first line is a “comment” that means that R will not do anything with it, it is for human use only. The next line is a simple sum. In your R terminal, this what you will type or copy after the >
prompt. It will rather look like this:
> # This is code > 1+pi [1] 4.141593 > Sys.getenv(c("EDITOR","USER","SHELL","XDG_SESSION_TYPE") EDITOR USER SHELL LC_NUMERIC "vi" "philippe" "/bin/bash" "pl_PL.UTF-8" >
In this, book there is nothing in front of a command and the reply of R is preceded by two pound signs: “##.”2 The pound sign (#
) is also the symbol used by R to precede a comment, hence R will ignore this line if fed into the command prompt. This allows you to copy and paste lines or whole chunks if you are working from an electronic version of the book. If the > sign would precede the command, then R would not understand if, and if you accidentally copy the output that from the book, nothing will happen because the #-sign indicates to R to ignore the rest of the line (this is a comment for humans, not for the machine).
The function Sys.getenv()
returns us all environment variables if no parameter is given. If