Business Experiments with R. B. D. McCullough
Чтение книги онлайн.
Читать онлайн книгу Business Experiments with R - B. D. McCullough страница 12
![Business Experiments with R - B. D. McCullough Business Experiments with R - B. D. McCullough](/cover_pre926115.jpg)
3 1.1.3 Give two situations where experiments can't be conducted.
1.2 Case: Credit Card Defaults
You work for a credit card company, and you want to figure out which customers might default. In the credit.csv
dataset are 30 000 observations on six variables: credit limit (how much can be charged on the credit card), sex of the cardholder, education level of the cardholder (high school, undergrad, grad, other), whether the cardholder is married (single, married, other), the age of the cardholder in years, and whether or not the cardholder defaulted (1 = default, 0 = non‐default).
In this problem we are confronted with the ultimate questions confronting all credit issuers: whether to grant credit to each potential customer and, if so, how much? Generally, we don't want to give credit to people who are likely to default, and if we do give credit, we don't want to give more than the person can repay.
Table 1.1 Credit default rates for men and women.
Female | Male | |
---|---|---|
0 | 14 349 (79%) | 9 015 (76%) |
1 | 3 763 (21%) | 2 873 (24%) |
Total | 18 112 | 11 888 |
A simple crosstab in Table 1.1 with the data shows that men are more likely to default than women. Another crosstab in Table 1.2 shows that divorced/widowed (other) persons are more likely to default.
Table 1.2 Credit default rates by marital status.
Married | Single | Other | |
---|---|---|---|
0 | 10 453 (77%) | 12 623 (79%) | 288 (76%) |
1 | 3 206 (23%) | 3 341 (21%) | 89 (24%) |
Total | 13 659 | 15 964 | 377 |
Try it!
We encourage you to replicate the analysis in this chapter using the data in the file credit.csv
. Computing crosstabs can be done in a spreadsheet using pivot tables. Most statistical tools also have a cross‐tabulation function.
df <- read.csv("credit.csv",header=TRUE) # Table 1.1 table1 <- table(df$default,df$sex) # to get the counts table1 # to print out the table prop.table(table1,2) # to get column proportions prop.table(table1,1) # to get row proportions
In addition to the categorical variables in our data set like sex and marital status, we also have continuous variables like age. Perusing the boxplots in Figure 1.2, it appears that persons who do not default have higher credit limits than persons who default, while age appears to have no association with default status.
Figure 1.2 Boxplot of default vs. non‐default for credit limit and ages.
If it is really the case that persons with higher credit limits are less likely to default, can we decrease the default rate simply by giving everybody a higher credit limit?
Software Details
To reproduce Figure 1.2, load the data file credit.csv
…
boxplot(limit∼default, xlab="default", ylab="credit limit", data=df)
We have thus far looked at how the four variables are associated with default, individually. How might we examine the effects of all the variables at one time in order to answer the two fundamental questions?
The answer, of course, is to use regression to relate default to all four variables at once. Since default is a categorical variable with two levels, linear regression is not appropriate. We would have to use logistic regression instead. As for the independent variables, credit limit and age are continuous and require no special treatment before being included in the regression (though it may be advantageous to turn each into a categorical variables with, say, categories “low,” “medium,” and “high”). Sex and marital status are categorical variables and will have to be included as dummy variables. If you are unfamiliar with the creation of dummy variables, sex can be represented by a single dummy variable, say,
Marital status (married, single, or divorced/widowed) will be represented by two dummy variables,
For a married person,