Fundamentals of Programming in SAS. James Blum

Чтение книги онлайн.

Читать онлайн книгу Fundamentals of Programming in SAS - James Blum страница 19

Автор:
Жанр:
Серия:
Издательство:
Fundamentals of Programming in SAS - James Blum

Скачать книгу

      The default summary statistics for PROC MEANS can be modified by including statistic keywords as options in the PROC MEANS statement. Several statistics are available, with the available set listed in the SAS Documentation, and any subset of those may be used. The listed order of the keywords corresponds to the order of the statistic columns in the table, and those replace the default statistic set. One common set of statistics is the five-number summary (minimum, first quartile, median, third quartile, and maximum), and Program 2.4.3 provides a way to generate these statistics for the four variables summarized in the previous example.

      Program 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS

      proc means data=BookData.IPUMS2005Basic min q1 median q3 max;

      var Citypop MortgagePayment HHIncome HomeValue;

      run;

      Output 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS

VariableMinimumLower QuartileMedianUpper QuartileMaximum
CITYPOPMortgagePaymentHHIncomeHomeValue00-29997.005000.000024000.00112500.000047200.00225000.000830.000000080900.009999999.0079561.007900.001739770.009999999.00

      Confidence limits for the mean are included in the keyword set, both as a pair with the CLM keyword, and separately with LCLM and UCLM. The default confidence level is 95%, but is changeable by setting the error rate using the ALPHA= option. Consider Program 2.4.4, which constructs the 99% confidence intervals for the means, with the estimated mean between the lower and upper limits.

      Program 2.4.4: Using the ALPHA= Option to Modify Confidence Levels

      proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01;

      var Citypop MortgagePayment HHIncome HomeValue;

      run;

      Output 2.4.4: Using the ALPHA= Option to Modify Confidence Levels

VariableLower 99%CL for MeanMeanUpper 99%CL for Mean
CITYPOPMortgagePaymentHHIncomeHomeValue2887.19498.438574963521.222783250.942916.66500.204263463679.842793526.492946.12501.969952063838.462803802.04

      There are also options for controlling the column display; rounding can be controlled by the MAXDEC= option (maximum number of decimal places). Program 2.4.5 modifies the previous example to report the statistics to a single decimal place.

      Program 2.4.5: Using MAXDEC= to Control Precision of Results

      proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01 maxdec=1;

      var Citypop MortgagePayment HHIncome HomeValue;

      run;

      Output 2.4.5: Using MAXDEC= to Control Precision of Results

VariableLower 99%CL for MeanMeanUpper 99%CL for Mean
CITYPOPMortgagePaymentHHIncomeHomeValue2887.2498.463521.22783250.92916.7500.263679.82793526.52946.1502.063838.52803802.0

      MAXDEC= is limited in that it sets the precision for all columns. Also, no direct formatting of the statistics is available. The REPORT procedure, introduced in Chapter 4 and discussed in detail in Chapters 6 and 7, provides much more control over the displayed table at the cost of increased complexity of the syntax.

      In several instances, it is desirable to split an analysis across a set of categories and, if those categories are defined by a variable in the data set, PROC MEANS can separate those analyses using a CLASS statement. The CLASS statement accepts either numeric or character variables; however, the role assigned to class variables by SAS is special. Any variable included in the CLASS statement (regardless of type) is taken as categorical, which results in each distinct value of the variable corresponding to a unique category. Therefore, variables used in the CLASS statement should provide useful groupings or, as shown in Section 2.5, be formatted into a set of desired groups. Two examples follow, the first (Program 2.4.6) providing an illustration of a reasonable class variable, the second (Program 2.4.7) showing a poor choice.

      Program 2.4.6: Setting a Class Variable in PROC MEANS

      proc means data=BookData.IPUMS2005Basic;

      class MortgageStatus;

      var HHIncome;

      run;

      Output 2.4.6: Setting a Class Variable in PROC MEANS

Analysis Variable : HHIncome
MortgageStatusN ObsNMeanStd DevMinimumMaximum
N/A30334230334237180.5939475.13-19998.001070000.00
No, owned free and clear30034930034953569.0863690.40-22298.001739770.00
Yes, contract to purchase9756975651068.5046069.11-7599.00834000.00
Yes, mortgaged/ deed of trust or similar debt54561554561584203.7072997.92-29997.001407000.00

      In this data, MortgageStatus provides a clear set of distinct categories and is potentially useful for subsetting the summarization of the data. In Program 2.4.7, Serial is used as an extreme example of a poor choice since Serial is unique to each household.

      Program 2.4.7: A Poor Choice for a Class Variable

      proc means data=BookData.IPUMS2005Basic;

      class Serial;

      var HHIncome;

      run;

      Output 2.4.7: A Poor Choice for a Class Variable (Partial Table Shown)

Analysis Variable : HHIncome
SERIALN ObsNMeanStd DevMinimumMaximum
21112000.00.12000.0012000.00
31117800.00.17800.0017800.00
411185000.00.185000.00185000.00
5112000.00.2000.002000.00

      Choosing Serial as a class variable results in each class being a single observation, making the mean, minimum, and maximum the same value and creating a situation where the standard deviation is undefined. Again, this would be an extreme case; however, class variables are best when structured to produce relatively few classes that represent a useful stratification of the data.

      Of course, more than one variable can be used in a CLASS statement; the categories are then defined as all combinations of the categories from the individual variables. The order of the variables listed in the CLASS statement only alters the nesting order of the levels; therefore, the same information is produced in a different row order in the table. Consider the two MEANS procedures in Program 2.4.8.

      Program 2.4.8: Using Multiple Class Variables and Effects of Order

      proc means data=BookData.IPUMS2005Basic nonobs n mean std;

      class

Скачать книгу