Fundamentals of Programming in SAS. James Blum
Чтение книги онлайн.
Читать онлайн книгу Fundamentals of Programming in SAS - James Blum страница 19
![Fundamentals of Programming in SAS - James Blum Fundamentals of Programming in SAS - James Blum](/cover_pre687444.jpg)
The default summary statistics for PROC MEANS can be modified by including statistic keywords as options in the PROC MEANS statement. Several statistics are available, with the available set listed in the SAS Documentation, and any subset of those may be used. The listed order of the keywords corresponds to the order of the statistic columns in the table, and those replace the default statistic set. One common set of statistics is the five-number summary (minimum, first quartile, median, third quartile, and maximum), and Program 2.4.3 provides a way to generate these statistics for the four variables summarized in the previous example.
Program 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS
proc means data=BookData.IPUMS2005Basic min q1 median q3 max;
var Citypop MortgagePayment HHIncome HomeValue;
run;
Output 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS
Variable | Minimum | Lower Quartile | Median | Upper Quartile | Maximum |
CITYPOPMortgagePaymentHHIncomeHomeValue | 00-29997.005000.00 | 0024000.00112500.00 | 0047200.00225000.00 | 0830.000000080900.009999999.00 | 79561.007900.001739770.009999999.00 |
Confidence limits for the mean are included in the keyword set, both as a pair with the CLM keyword, and separately with LCLM and UCLM. The default confidence level is 95%, but is changeable by setting the error rate using the ALPHA= option. Consider Program 2.4.4, which constructs the 99% confidence intervals for the means, with the estimated mean between the lower and upper limits.
Program 2.4.4: Using the ALPHA= Option to Modify Confidence Levels
proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01;
var Citypop MortgagePayment HHIncome HomeValue;
run;
Output 2.4.4: Using the ALPHA= Option to Modify Confidence Levels
Variable | Lower 99%CL for Mean | Mean | Upper 99%CL for Mean |
CITYPOPMortgagePaymentHHIncomeHomeValue | 2887.19498.438574963521.222783250.94 | 2916.66500.204263463679.842793526.49 | 2946.12501.969952063838.462803802.04 |
There are also options for controlling the column display; rounding can be controlled by the MAXDEC= option (maximum number of decimal places). Program 2.4.5 modifies the previous example to report the statistics to a single decimal place.
Program 2.4.5: Using MAXDEC= to Control Precision of Results
proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01 maxdec=1;
var Citypop MortgagePayment HHIncome HomeValue;
run;
Output 2.4.5: Using MAXDEC= to Control Precision of Results
Variable | Lower 99%CL for Mean | Mean | Upper 99%CL for Mean |
CITYPOPMortgagePaymentHHIncomeHomeValue | 2887.2498.463521.22783250.9 | 2916.7500.263679.82793526.5 | 2946.1502.063838.52803802.0 |
MAXDEC= is limited in that it sets the precision for all columns. Also, no direct formatting of the statistics is available. The REPORT procedure, introduced in Chapter 4 and discussed in detail in Chapters 6 and 7, provides much more control over the displayed table at the cost of increased complexity of the syntax.
2.4.2 Using the CLASS Statement in PROC MEANS
In several instances, it is desirable to split an analysis across a set of categories and, if those categories are defined by a variable in the data set, PROC MEANS can separate those analyses using a CLASS statement. The CLASS statement accepts either numeric or character variables; however, the role assigned to class variables by SAS is special. Any variable included in the CLASS statement (regardless of type) is taken as categorical, which results in each distinct value of the variable corresponding to a unique category. Therefore, variables used in the CLASS statement should provide useful groupings or, as shown in Section 2.5, be formatted into a set of desired groups. Two examples follow, the first (Program 2.4.6) providing an illustration of a reasonable class variable, the second (Program 2.4.7) showing a poor choice.
Program 2.4.6: Setting a Class Variable in PROC MEANS
proc means data=BookData.IPUMS2005Basic;
class MortgageStatus;
var HHIncome;
run;
Output 2.4.6: Setting a Class Variable in PROC MEANS
Analysis Variable : HHIncome | ||||||
MortgageStatus | N Obs | N | Mean | Std Dev | Minimum | Maximum |
N/A | 303342 | 303342 | 37180.59 | 39475.13 | -19998.00 | 1070000.00 |
No, owned free and clear | 300349 | 300349 | 53569.08 | 63690.40 | -22298.00 | 1739770.00 |
Yes, contract to purchase | 9756 | 9756 | 51068.50 | 46069.11 | -7599.00 | 834000.00 |
Yes, mortgaged/ deed of trust or similar debt | 545615 | 545615 | 84203.70 | 72997.92 | -29997.00 | 1407000.00 |
In this data, MortgageStatus provides a clear set of distinct categories and is potentially useful for subsetting the summarization of the data. In Program 2.4.7, Serial is used as an extreme example of a poor choice since Serial is unique to each household.
Program 2.4.7: A Poor Choice for a Class Variable
proc means data=BookData.IPUMS2005Basic;
class Serial;
var HHIncome;
run;
Output 2.4.7: A Poor Choice for a Class Variable (Partial Table Shown)
Analysis Variable : HHIncome | ||||||
SERIAL | N Obs | N | Mean | Std Dev | Minimum | Maximum |
2 | 1 | 1 | 12000.00 | . | 12000.00 | 12000.00 |
3 | 1 | 1 | 17800.00 | . | 17800.00 | 17800.00 |
4 | 1 | 1 | 185000.00 | . | 185000.00 | 185000.00 |
5 | 1 | 1 | 2000.00 | . | 2000.00 | 2000.00 |
Choosing Serial as a class variable results in each class being a single observation, making the mean, minimum, and maximum the same value and creating a situation where the standard deviation is undefined. Again, this would be an extreme case; however, class variables are best when structured to produce relatively few classes that represent a useful stratification of the data.
Of course, more than one variable can be used in a CLASS statement; the categories are then defined as all combinations of the categories from the individual variables. The order of the variables listed in the CLASS statement only alters the nesting order of the levels; therefore, the same information is produced in a different row order in the table. Consider the two MEANS procedures in Program 2.4.8.
Program 2.4.8: Using Multiple Class Variables and Effects of Order
proc means data=BookData.IPUMS2005Basic nonobs n mean std;
class