﻿ University of Manitoba - Faculty of Medicine - Community Health Sciences - Manitoba Centre for Health Policy - I. Data Preparation: Types of Data Sets
I. Data Preparation: Types of Data Sets

This section provides both general guidelines and specific details on preparing data sets (both SAS and non-SAS). For the data sets referenced in this manual, detailed instructions are provided in other sections on how to prepare:

Data sets can be thought of as a table having columns and rows, and consisting of three main components:

1. Values - Numbers and/or letters of the alphabet comprising the information in each cell (column/row combination).
2. Variables - Names assigned to columns of information; currently, they can be up to 32 characters long (starting with a letter or underscore).
3. Observations (Records) - Usually one line, or row, of information per person or event; each observation (also known as a record) consists of a set of values.

Illustrated below, on the left, is the raw data set htwt. The viewtable on the right shows how the data become meaningful once the appropriate information on the data has been specified to SAS.

M and F, for example, are the values for the variable SEX; the next column represents the values for the variable AGE. Each observation is now consecutively numbered, in this case, from 1 to 18. The first observation thus has a value of Aubrey for the variable called NAME, a value of M for the variable SEX, a value of 41 for the variable AGE, a value of 74 for the variable HEIGHT, and a value of 170 for the variable WEIGHT.

In other words, the first observation is an individual named Aubrey who is a 41-year-old male who is 6'2" tall and weighs 170 pounds. (Normally a codebook will specify the units in which the values are being measured. In this case, height, for example, is measured in inches and weight is measured in pounds.)

The values for any given variable will have the following characteristics:

• Length. Numeric values are stored in SAS as floating-point, or real binary, numbers. According to the SAS Language Reference (1990:86), floating point representation is "a form of storing in scientific notation" ("in which values are represented as numbers between 0 and 1 times a power of 10") "except that on most operating systems the base is not 10, but is either 2 or 16".

SAS assigns a default length of 8 bytes of space to numeric variables and, where space permits, this need not be reduced. Cody and Pass (1995:276) indicate that "this does not mean 8 significant figures; it means that 8 times 8, or 64 bits (8 bits per byte) are used to store the number". They add that 8 bytes "is equivalent to what used to be called 'double-precision' in other languages. This will vary not only by which computer language you are using, but on which computer and under what operating system you are running".

4 bytes of space is generally sufficient for most numeric variables, but a SAS Institute manual should generally be consulted before changing the lengths of numeric variables because of the potential loss of precision.

• Numeric vs character values. If calculations are not necessary (e.g., values of 1 and 2 for GENDER), it is recommended that such numeric values be assigned as character values (in the INPUT statement when reading a raw data set, or using a PUT statement within a DATA step when accessing a SAS data set). Conversion, where possible, to character decreases space requirements. Values for the variable GENDER, for example, if numeric, can take up to 8 bytes of space, but assigning it as a character value decreases its size to 1 byte. References to all character values in a program must be enclosed in single quotes (e.g., if gender='1';).

• Missing values. Numeric missing values are denoted with a period (.) while character missing values are denoted with a space in quotes ('  ').

Analyses using SAS software requires the data to be in the form of a SAS data set. If the data are in this form, no data preparation is needed; the data set can be easily viewed and explored using a SAS program to create either: a) a permanent or b) a temporary SAS data set. A permanent SAS data set is normally created if it is known which subsets are being used for analysis (e.g., if only bypass procedure is of interest, only the hospital records that actually contain this surgery would be needed). Temporary SAS data sets, on the other hand, last for the duration of the SAS session, and are useful when developing and debugging programs. Non-SAS data sets require additional preparation because they need to be converted to SAS data sets.

A. PREPARING SAS DATA SETS

To create a temporary SAS data set from another temporary SAS data set requires only the DATA and SET statements and the one-word name of the data set. To read or to create a permanent SAS data set requires a third statement - LIBNAME - to tell SAS where to find the data set.

• LIBNAME sasref 'c:\sasdir'; tells SAS that the user has chosen sasref as the name to represent a directory on c: drive called sasdir. That is, LIBNAME tells SAS where the permanent SAS data set is located (or to be located).

• DATA two; (to create a temporary SAS data set) OR
DATA sasref.two; (to create a permanent SAS data set).

The DATA statement is required to tell SAS to create another, new, SAS data set from the the data set specified in the SET statement following.

• SET one; (to read/access a temporary SAS data set) OR
SET sasref.one; (to read/access a permanent SAS data set)

For permanent SAS data sets, the SET statement tells SAS: a) the name assigned to the directory, or location, of the data set (sasref) and b) the name of the permanent SAS data set (one). For temporary SAS data sets only the one-word file name need be specified. (SAS will automatically assign the temporary SAS data set to the WORK library for the duration of the SAS session, but the temporary data set can be referred to without specifying this library.)

The SET statement thus tells SAS to read (or access) the data set, loading the information into memory so that the user can view or manipulate it. Any changes made to the data set specified in the SET statement will be reflected in the data set specified in the DATA statement, NOT in the original data (unless it is being saved with exactly the same name).

These keywords are illustrated in the accompanying SAS program examples.

It can be possible to create a large number of temporary SAS data sets in the course of a SAS session, but generally it is desirable to conserve space. If the reason for creating the new data set is to create more variables, for example, the same data set name can be used (e.g., data one; set one;). This will simply overwrite the previous data set. Subsetting the data (e.g., keeping only age 65+) will also conserve space. Assigning a new name in this case (e.g., data age65p; set one;) will permit the user to access either data set during the SAS session.

Note that two options can be useful when creating permanent SAS data sets; both are placed in the DATA statement; for example: data sasref.new (compress=yes label='Simulated MB Health data');

• compress - removes the extra space used by non-filled or partially filled variables. This option can reduce the size of a data set quite substantially, but note that in certain cases, it can actually increase its size. The log will provide this type of information so that any necessary adjustments can be made.
• label - permits adding a brief description of the data set; this information would then be seen in output generated by PROC CONTENTS.
Additional information on reducing the space taken up by SAS data sets is available from the MCHP document titled "Saving Space in SAS".
B. PREPARING NON-SAS DATA SETS

The non-SAS data set, which can be converted to a temporary or permanent SAS data set, can take one of two forms: I) an ASCII file or II) a file generated by another software package.

I. A file of ASCII (raw) data

This may look something like the following:

`12 38 8 01127522 18 9 00008831 0 4 100`

These numbers cannot be meaningfully manipulated unless the user is given additional information to tell SAS the variable names, their locations, and whether they should be read in as numeric or character. The SAS program must use an INPUT statement to provide SAS with this information, as well as FILENAME and INFILE statements (the Simulated Manitoba Health data set is an example of this type of data set):

• A FILENAME statement is necessary to specify which file contains the raw data. The statement FILENAME rawref 'c:\sasref awdata'; provides both the location (sasref directory) and the file containing the raw data (rawdata).

• An INFILE statement is used to indicate that a raw data set should be read in by SAS, as specified in the FILENAME statement (INFILE is used to read raw data, while SET is used to read in SAS data). INFILE rawref; tells SAS to read the raw data set, name and location as described in the FILENAME statement.

• An INPUT statement is used to provide SAS with variable names, column numbers, and numeric/character information. For example, INPUT age 14-15 gender \$20; indicates that age is two columns, or spaces, in width, starting at column 14 (this would not accommodate ages with 3 digits, i.e., ages>99), and that gender is 1 column wide, located at column 20. The dollar sign denotes that gender is to be read in as a character variable while the absence of a dollar sign indicates that age is to be read in as a numeric variable.

Data values can also be located within a SAS program. In this case, a CARDS keyword is necessary to signal SAS that raw data values are to be read in within the program. The Height/Weight data set and the Clinical data set illustrate this approach.

II. A file generated by another software package

While ASCII files and input statements are the most commonly used method for importing data into SAS, many other programs store information in a format that SAS can read. These programs often keep information on variable type, length, and format. SAS can access information stored in formats saved by other programs in several ways.

1. Using delimiters in a 2-step process

a) Convert the external file to ASCII (text) format.

The external file from the native program is saved in ASCII format, with a special character or delimiter between each field (variable). Quote/comma and tab delimited are the most common types of delimiter. For example, delimited ASCII files can be saved from Excel 2002 by selecting 'Save as' and saving the file with a type of Text (Tab delimited)(*.txt).

b) Import the file into SAS.

• If using a SAS program, the delimiting character is defined in the INFILE statement, e.g.:
```     FILENAME RAWFILE 'C:\SASREF\RAWDATA' ;

DATA NEW ;
INFILE RAWFILE DELIMITER='09'x  ;
** '09'x stands for the tab
character in hexidecimal format,
the delimiter=dsd option can be used
for quote/comma separated files;
INPUT ONE TWO ;
```
• If using a graphical interface, select Import Data from the SAS File menu and select User Defined file format: EFI or select one of the pre-defined files under Standard data source, i.e., Delimited, Comma Separated Values, or Tab Delimited File.

2. Direct Access Using Engines

A format-specific engine on a LIBNAME statement will permit reading some data file formats directly. Generally files must be saved in a general or portable format prior to importing them into SAS. SPSS is used here as an example but other engines are available. Prior to importing an SPSS file it must first be saved as an SPSS Portable format (.por) file from within SPSS (note that the SPSS engine does not work under SAS 6.12 and Solaris (UNIX)). A SAS program could then be used to read in the file, for example:

```    LIBNAME IMPORT SPSS 'c:	emp\spssfile.por' ;
** Note: file must be in SPSS portable format ;

DATA TEST ;
SET IMPORT._FIRST_ ;
** _first_ is the first, and only, data set
in the library, or directory.  ;

RUN ;
```

or, PROC CONVERT can be used with the following syntax:

```     FILENAME IMPTSP 'c:	emp\spssfile.por' ;
PROC CONVERT SPSS=IMPTSP DATA=TEST ;
RUN;
```

3. Importing data into SAS

PROC IMPORT may be used to import a data set, or a data table, from a variety of different sources. In some cases, the specific rows and columns will have to be defined; in other cases whole tables can be imported. PROC IMPORT can be used to import files created from such programs as MS Access (ACCESS), DBase (DBF), Lotus (WK1, WK3, WK4), and Excel (EXCEL, EXCEL4, EXCEL5, EXCEL97, EXCEL 2002). ASCII delimited files can also be imported as delimited files (DLM, CSV, TAB). The Access component of SAS software must be installed and available for each file type.

Data can be imported:

• from an MS Access database:
``` PROC IMPORT OUT= temp
DATATABLE= "Base"
DBMS=ACCESS  REPLACE;
** DBMS defines the database or file type;
DATABASE="C:\database.mdb";
RUN;```
• from an Excel database:

PROC IMPORT OUT= htwt2
DATAFILE= "x:\sasdir\htwt.xls"
DBMS=EXCEL2002 REPLACE;
GETNAMES=YES;
RUN;

• the SAS menu, i.e., this procedure can be used interactively through the File/Import/Standard Data Source option.

 << Previous Index Next >>