Fundamentals of Programming in SAS. James Blum

Чтение книги онлайн.

Читать онлайн книгу Fundamentals of Programming in SAS - James Blum страница 29

Автор:
Жанр:
Серия:
Издательство:
Fundamentals of Programming in SAS - James Blum

Скачать книгу

that include the tab, each must use a hexadecimal representation—for example, DLM= ‘2C09’x selects commas and tabs as delimiters since 2C is the hexadecimal value for a comma. For records with different delimiters within the same DATA step, see Chapter Note 7 in Section 2.12.

      While delimited data takes advantage of delimiting characters in the data, other files depend on the starting and stopping position of the values being read. These types of files are referred to by several names: fixed-width, fixed-position, and fixed-field, among others. The first five records from a fixed-position file (IPUMS2005Basic.dat) are shown in Input Data 2.8.8. As with Input Data 2.8.4, truncation of this display occurs due to the length of the record—now occurring in each of the five records.

      Input Data 2.8.8: Excerpt from a Fixed-Position Data File

----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
2 Alabama Not in identifiable city (or size group) 0 4 73
3 Alabama Not in identifiable city (or size group) 0 1 0
4 Alabama Not in identifiable city (or size group) 0 4 73
5 Alabama Not in identifiable city (or size group) 0 1 0
6 Alabama Not in identifiable city (or size group) 0 3 97

      Since fixed-position files do not use delimiters, reading a fixed-position file requires knowledge of the starting position of each data value. In addition, either the length or stopping position of the data value must be known. Using the ruler, the first displayed field, Serial, appears to begin and end in column 8. However, inspection of the complete raw file reveals that is only the case for the single-digit values of Serial. The longest value is eight-digits wide, so the variable Serial truly starts in column 1 and ends in column 8. Similarly, the next field, State, begins in column 10 and ends in column 29. Some text editors, such as Notepad++ and Visual Studio Code, show the column number in a status bar as the cursor moves across lines in the file.

      The DATA step for reading fixed-position data looks similar to the DATA step for reading delimited data, but there are several important modifications. For fixed-position files, the syntax of the INPUT statement provides information about column positions of the variable values in the raw file, as it cannot rely on delimiters for separating values. Therefore, delimiter-modifying INFILE options such as DSD and DLM= have no utility with fixed-position data. Two different forms of input are commonly used for fixed-position data: column input or formatted input. This section focuses on column input while Chapter 4 discusses formatted input.

      Column Input

      Column input takes advantage of the fixed positions in which variable values are found by directly placing the starting and ending column positions into the INPUT statement. Program 2.8.8 shows how to use column input to read the IPUMS CPS 2005 basic data. The results of Program 2.8.8 are identical to Output 2.8.5.

      Program 2.8.8: Reading Data Using Column Input

      data work.ipums2005basicFPa;

      infile RawData (‘ipums2005basic.dat’);

      input serial 1-8 state $ 10-29  city $ 31-70  cityPop 72-76 

      metro 78-80 countyFips 82-84 ownership $ 86-91

      mortgageStatus $ 93-137 mortgagePayment 139-142

      HHIncome 144-150 homeValue 152-158;

      run;

       The LENGTH statement is no longer needed—when using column input, SAS assigns the length based on the number of columns read if the length attribute is not previously defined. Here, SAS assigns State a length of 20 bytes, just as was done in the LENGTH statement in Program 2.8.5.

       The first value indicates the column position—31—from which SAS should start reading for the current variable, City. The second number—70—indicates the last column SAS reads to determine the value of City.

       The default length of eight bytes is still used for numeric variables, regardless of the number of columns.

      Beyond the differences between column input and list input shown in Program 2.8.8, since column input uses the column positions, the INPUT statement can read variables in any order, and can even reread columns if necessary. Furthermore, the INPUT statement can skip unwanted variables. Program 2.8.9 reads Input Data 2.8.8 and demonstrates the ability to reorder and reread columns.

      Program 2.8.9: Reading the Input Variables Differently than Column Order

      data work.ipums2005basicFPb;

      infile RawData(‘ipums2005basic.dat’);

      input serial 1-8 hhIncome 144-150 homeValue 152-158 

      ownership $ 86-91 ownershipCoded $ 86 

      state $ 10-29 city $ 31-70 cityPop 72-76

      metro 78-80 countyFips 82-84

      mortgageStatus $ 93-137 mortgagePayment 139-142;

      run;

      proc print data = work.ipums2005basicFPb(obs = 5);

      var serial -- state;

      run;

       Output 2.8.9 shows that HHIncome and HomeValue are now earlier in the data set. Column input allows for reading variables in a user-specified order.

       Column 86 is read twice: first as part of a full value for Ownership, and second as a simplified version using only the first character as the value of a new variable, OwnershipCoded.

       As discussed in Chapter Note 3 in Section 1.7, the double-dash selects all variables between Serial and State, inclusive.

      Output 2.8.9: Reading the Input Variables Differently than Column Order

ObsserialhhIncomehomeValueownershipownershipCodedstate
12120009999999RentedRAlabama
23178009999999RentedRAlabama
34185000137500OwnedOAlabama
4520009999999RentedRAlabama
567260095000OwnedOAlabama

      Mixed Input

      Programs 2.8.1 through 2.8.7 make use of simple list input for every variable, and Programs 2.8.8 and 2.8.9 use column input for every variable. However, it may not always be the case of making a choice between one or the other. If files contain some delimited fields while other fields have fixed positions, it is necessary to use multiple input styles simultaneously. This process, called mixed input, requires mastery of two other input methods covered in Chapter 3, modified list input and formatted input, along with a substantial understanding of how the DATA step processes raw data. For a discussion of the fifth and final input style, named input, see the SAS Documentation.

      This section provides further details about how the DATA step functions. While this material can initially be considered optional for many readers, understanding it makes writing high-quality code easier by providing a foundation for how certain coding decisions lead to particular outcomes. This material is also essential for successful completion of

Скачать книгу