IV. Data Manipulation: Basic Techniques

SAS provides for many optional statements and keywords that can be used in SAS programs to facilitate manipulation and display of the data. Statements can often (but not always) be entered in any order within DATA and PROC steps, while options must usually be placed in a specific position within a SAS statement. The data= option, for example, can be added to most procedures to specify the data set on which it should be run, e.g., proc freq data=test;. Unless the data set is specified (in this case, "test"), SAS will automatically go to the most recently created data set.

Two broad categories of statements/keywords are described here - those that can be used to: a) create subgroups of data, and b) customize display of output. A SAS program incorporating the use of these statements/keywords follows.

A. Create Subgroups of Data

Analysis and space requirements will often dictate whether to create separate SAS data sets (temporary or permanent) for analysis or to simply split the output by the desired values of a variable. If all analysis is to be conducted on individuals age 65+, for example, it might be desirable to create a separate, permanent SAS data set, removing all records having an age of less than 65. Not only is the data set being tailored to meet analysis needs, program efficiency is also enhanced by reducing the amount of time and space being used to carry out SAS runs - extremely important if computer resources are limited in terms of both physical and memory space.

Three approaches to creating subgroups of data are described here, specifying 1) data values, 2) variable names, or 3) number of observations to reduce the data set.

  1. Specify data values with the WHERE or IF statements to keep a subset of records, or observations. Although only the observations containing the specified value(s) will be kept, all other variables associated with these observations will also be kept.

    where age>=65; This can be used within a PROC or DATA step to keep only those records having an age of 65 or older within a DATA step.

    if age>=65; This can only be used in a DATA step; however, the advantage of using IF is that a number of conditions can be specified:

    • if age>=65 and gender='F'; This statement tells SAS to keep all females who are 65 years of age and older.
    • if age>=65 or gender='F'; This statement tells SAS to keep all females and all people (both male and female) who are 65 years of age and older.

    IF statements can be used on variables created in the current data step. WHERE statements can only be applied to variables that pre-exist in the data.

  2. Specify variable names using the KEEP keyword/statement (to keep selected variables) or the DROP keyword/statement (to drop selected variables) to create a subset of variables.

    e.g., data new;
           set test (keep=regionre los);
           run;

    This above is an an example of using KEEP as a keyword; it keeps 2 variables from test in the "new" data set as the data set is being read in (the other variables, however, are still accessible for other data steps). (Alternatively, the KEEP statement can be placed at the end of the DATA statement, but this is less efficient because all variables in the test data set will be processed.)

    e.g., data test2;
           set test;
           drop drg drgrgn drgw;
           run;

    The KEEP and DROP keywords can also be used as a statement. In this case, only the DROP keyword is used and 3 variables are dropped from the new test2 data set. This statement is often used when new variables have been created from existing variables, and the existing variables are no longer necessary for analysis.

    e.g., input age 7-9 regionre 51 deathsep 55-58;

    When reading in raw data (e.g., the simulated Manitoba Health data), only the variables necessary to the analysis need to be read in.

    Note that use of the DROP and KEEP keywords only affect variables; they do not affect the number of observations.

  3. Specify number of observations with the (OBS=) option to keep a subset of records.

    e.g., data test;
           set test (obs=10);
           run;

    This option is very useful when testing/debugging SAS programs or portions of code. It is easily removed, and the program can be re-submitted to obtain output for all the observations. The following illustrates how the option can also be used when reading in rawdata.

    e.g., data test;
           infile rawfile (obs=10);


B. Customize Display of Output

Output can be enhanced in a number of ways, only a few of which are presented here: 1) LABEL, 2) FORMAT, 3) TITLE, and 4) FOOTNOTE statements.

  1. Change how variable information is displayed by using a LABEL statement.

    e.g., label height = 'Height in inches'
            weight = 'Weight in inches';

    This code will attach labels to height and weight so that the labels rather than just the variable names will be displayed in any output. Labels currently can be up to 256 characters long, and must be enclosed in single quotes (double quotes if there is an apostrophe in the label). The LABEL statement usually goes in the DATA step, near the end, and is very helpful for explaining what the variables represent, particularly if other users will be accessing the data.

  2. Change how data values are displayed by using the FORMAT statement.

    e.g., format gender $genderl. regionre regionh $regionl.;

    This FORMAT statement assumes that formats called $gender and $regionl have been created (using PROC FORMAT). The variable(s) is specified first, and then the format, which always has a period at the end of it (at least one space between each). The $ denotes a character format; it can only be used with character variables. The FORMAT statement will result, for example, in the values of the variables regionre and regionh being displayed as full region names rather than '1' through '8' although certain SAS procedures may truncate the formatted values from the maximum allowable 32 characters long to 8 or 16 characters in length).

    The FORMAT statement can be used within a PROC or DATA step. If used within a DATA step, the format is applied in all procedures referencing the data set. If used within a PROC step (generally preferred), the format is only applied for that specific procedure. The original, underlying values are NOT permanently changed; only how they appear in output is changed.

    Note that SAS also has its own library of formats that are available throughout any SAS session.

  3. Enhance output by adding a title(s) using the TITLE statement.

    e.g., title1 'An example of a title'; can be used within a PROC or DATA step, or by itself, to place a title on each page of subsequent output. Note that:

    • Titles are enclosed in single quotes; anything within the quotes is not processed by SAS. Double quotes are used if the title contains any apostrophes.
    • Up to 10 titles can be added, using up to 10 TITLE statements. Each additional title is numbered, e.g., title2 'An example of a 2nd title';
    • TITLE statements are global statements, which means that they are displayed on every subsequent page of output in a SAS session. They can be either overwritten (by specifying new TITLE statements) or cleared (by specifying in the program e.g., title1;, to clear a TITLE1 statement, title2; to clear a TITLE2 statement, and so on).

     

  4. Enhance output by adding a footnote(s) using the FOOTNOTE statement.

    e.g., footnote1 'An example of a footnote'; can be used within a PROC or DATA step, or by itself, to place a footnote on each page of subsequent output. The above notes for the TITLE statement also apply to the FOOTNOTE statement.


*******************************************************
*This program creates two data sets "men" and "women" *
*and generates 2 versions of tables showing the       *
*distribution of name separately for males and for    *
*females over age 40. It assumes that the data set    *
*"htwt" has been previously created. It also assumes  *
*that labels have not yet been created for the        *
*variables.                                           *
*******************************************************;


proc format;          /*Create label format for sex*/
  value $sexl 'M'='Male'
              'F'='Female';

/*Create 2 new temporary SAS data sets*/

data men women;

        /*Read in the "htwt" data set, keeping only
           3 of the variables and the first 10 records*/
  set htwt (keep=sex name age obs=10);

       /*For the "men" data set keep only the records
          that have a value of "M" for sex */
  if sex='M' then output men;

     /*For the "women" data set keep only the records
        that have a value of "F" for sex */
      /*(Note that records missing values for sex
            would not go into either data set) */ 
  else if sex='F' then output women;

  /*Create labels for the variables being kept*/
  label name = 'Name of individual'
        age    = 'Age at admission'
        sex = 'Gender of patient';

run;

 proc freq data=men;
   tables name;
   where age>=40;
   format sex $sexl.
  title1 'Example re use of basic techniques';
  title2 'The Student Project';
  footnote1 'Limiting age to 40+';
run;

 proc freq data=women;
   tables name;
   where age>=40;
   format sex $sexl.;
run;

***********************************************;
*Instead of the 2 PROC steps used above, the  *
*same analysis could be done as follows using *
*PROC SORT and BY variable processing.        *
***********************************************;

/*Sort the data by sex prior to creating 
tables that are split by sex */

proc sort data=htwt;
  by sex; 
run;

proc freq data=htwt;

  /* Create a table of distribution of name */
  tables name;  

  /* Do this separately for each value of sex*/
  by sex;    

   /* Do this only for age 40+*/
 where age>=40;  

    /* Display formatted values*/
  format sex $sexl. ;

  title1 'Example re use of basic techniques';
  title2 'The Student Project';
  footnote1 'Limiting age to 40+';

run;

PRACTICE QUESTIONS ON DATA MANIPULATION (BASIC)

These questions assume that a permanent SAS data set has been created from the sample Clinical data set. The format file does not need to be included. Examples are given for how program, log, and output might look.

  1. Create 3 separate temporary SAS data sets for each of the following and carry out any SAS procedures that will indicate only the specified data was kept.

    • only pregnant females.
    • only data on blood pressure and heart rate data, along with the id number (for the whole sample).
    • only the first 15 observations.
  2. Display the distribution of pregnant by vitamins. Re-submit this table after adding the following:

    • A new label for the vitamins variable: Patient on vitamin therapy.
    • New labels for the values of pregnant: label "1" as 3+ mos.pregnant and "0" as LT 3 mos. pregnant
    • A title showing the source of data and a footnote reflecting the type of exercise.

 << Previous

 Index

 Next >>