Fundamentals of Programming in SAS. James Blum

Чтение книги онлайн.

Читать онлайн книгу Fundamentals of Programming in SAS - James Blum страница 26

Автор:
Жанр:
Серия:
Издательство:
Fundamentals of Programming in SAS - James Blum

Скачать книгу

data sources begins by exploring how to read raw data files.

      Raw data refers to certain files that contain unprocessed data that is not in a SAS data set. (Certain other structures also qualify as raw data. See Chapter Note 6 in Section 2.12 for additional details.) Generally, these are plain-text files and some common file types are:

       tab-delimited text (.txt or .tsv)

       comma-separated values (.csv)

       fixed-position files (.dat)

      Choices for file extensions are not fixed; therefore, the extension does not dictate the type of data the file contains, and many other file types exist. Therefore, it is always important to explore the raw data before importing it to SAS. While SAS provides multiple ways to read raw data; this chapter focuses on using the DATA step due to its flexibility and ubiquity—understanding the DATA step is a necessity for a successful SAS programmer.

      To assist in finding the column numbers when displaying raw files, a ruler is included in the first line when presenting raw data in the book, but the ruler is not present in the actual data file. Input Data 2.8.1 provides an example of such a ruler. Each dash in the ruler represents a column, while a plus represents multiples of five, and a digit represents multiples of ten. For example, the 1 in the ruler represents column 10 in the raw file and the plus sign between the 1 and the 2 represents column 15 in the raw file.

      Input Data 2.8.1: Space Delimited Raw File (Partial Listing)

----+----1----+----2----+
1 1800 9998 9998 9998
2 480 1440 9998 9998
3 2040 360 100 9998
4 3000 9998 360 9998
5 840 1320 90 9998

      Delimiters, often used in raw files, are a single character such as a tab, space, comma, or pipe (vertical bar) used to indicate the break between one value and the next in a single record. Input Data 2.8.1 includes a partial representation of the first five records from a space-delimited file (Utility 2001.prn). Reading in this file, or any raw file, requires determining whether the file is delimited and, if so, what delimiters are present. If a file is delimited, it is important to note whether the delimiters also appear as part of the values for one or more variables. The data presented in Input Data 2.8.1 follows a basic structure and uses spaces to separate each record into five distinct values or fields. SAS can read this file correctly using simple list input without the need for additional options or statements using the following rules:

      1. At least one blank/space must separate the input values and SAS treats multiple, sequential blanks as a single blank.

      2. Character values cannot contain embedded blanks.

      3. Character variables are given a length of eight bytes by default.

      4. Data must be in standard numeric or character format. Standard numeric values must only contain digits, decimal point, +/-, and E for scientific notation.

      Input Data 2.8.1 satisfies these rules using the default delimiter (space). Options and statements are available to help control the behavior associated with rules 1 through 3, which are covered in subsequent sections of this chapter. Violating rule 4 precludes the use of simple list input but is easily addressed with modified list input, as shown in Chapter 3. However, no such options or modifications are required to read Input Data 2.8.1, which is done using Program 2.8.1.

      Program 2.8.1: Reading the Utility 2001 Data

      data Utility2001;

      infile “--insert path here--\Utility 2001.prn”;

      input Serial$ Electric Gas Water Fuel;

      run;

      proc print data = Utility2001 (obs=5 );

      run;

       The DATA statement begins the DATA step and here names the data set as Utility2001, placing it in the Work library given the single-level naming. Explicit specification of the library is available with two-level naming, for example, Sasuser.Utility2001 or Work.Utility2001—see Program 1.4.3. If no data set name appears, SAS provides the name as DATAn, where n is the smallest whole number (1, 2, 3, …) that makes the data set name unique.

       The INFILE statement specifies the location of the file via a full path specification to the file—this path must be completed to reflect the location of the raw file for the code to execute successfully.

       The INPUT statement sets the names of each variable from the raw file in the INFILE statement with those names following the conventions outlined in Section 1.6.2. By default, SAS assumes the incoming variables are numeric. One way to indicate character data is shown here – place a dollar sign after each character variable.

       Good programming practice dictates that all steps end with an explicit step boundary, including the DATA step.

       The OBS= option selects the last observation for processing. Because procedures start with the first observation by default, this step uses the first five observations from the Utility2001 data set, as shown in Output 2.8.1.

      Output 2.8.1: Reading the Utility 2001 Data (Partial Listing)

ObsSerialElectricGasWaterFuel
111800999899989998
22480144099989998
3320403601009998
44300099983609998
558401320909998

      In Program 2.8.1, Serial is read as a character variable; however, it contains only digits and therefore can be stored as numeric. The major advantage in storing Serial as character is size—its maximum value is six digits long and therefore requires six bytes of storage as character, while all numeric variables have a default size of eight bytes. The major disadvantage to storing Serial as character is ordering—for example, as a character value, 11 comes before 2. While the other four variables can be read as character as well, it is a very poor choice as no mathematical or statistical operations can be done on those values. For examples in subsequent sections, Serial is read as numeric.

      In Program 2.8.1, the INFILE statement is used to specify the raw data file that the DATA step reads. In general, the INFILE statement may include references to a single file or to multiple files, with each reference provided one of the following ways:

       A physical path to the files. Physical paths can be either relative or absolute.

       A file reference created via the FILENAME statement.

      Program 2.8.1 is set up to use the first method, with either an absolute or relative path chosen. An absolute path starts with a drive letter or name, while any other specification is a relative path. All relative paths are built from the current working directory. (Refer to Section 1.5 for a discussion of the working directory and setting its value.) It is often more efficient to use a FILENAME statement to build references to external files or folders. Programs 2.8.2 and 2.8.3 demonstrate these uses of the FILENAME statement, producing the same data set as Program 2.8.1.

      Program 2.8.2: Using the FILENAME Statement to Point to an Individual File

      filename Util2001  “--insert path here--\Utility 2001.prn”;

      data

Скачать книгу