IV. Data Manipulation: Create New Variables

New variables can only be created within the context of a DATA step; they will be included in the new data set specified in the DATA statement. In the following example, the temporary SAS data set addvar will contain 2 new variables: newvar and newvar2:

    data addvar;
       set test;
       newvar=(regionre='A');
       newvar2=(put(age,agefmt.));
     run;

When creating new variables, several guidelines are important:

  • Numeric vs character. It should always be determined whether the existing variables are character or numeric as this will affect how the values will be referenced.

  • Naming variables. A recommended practice is to always give new variables new names. This enables others who may be using the data to feel confident that the original names represent the original variables. This convention also provides a way of checking the original variable against newly created ones to ensure their accuracy. 

  • Repetitive tasks. This is not currently a factor in the simulated Manitoba Health data, but it should be pointed out that alternate approaches are available for accomplishing repetitive tasks in SAS programming. If the same processing, for example, has to be done on the same kind of variable (e.g., diagnosis) and there are 16 fields, or variables, for this information (e.g., DIAG01 to DIAG16), a DO loop, combined with an ARRAY statement, is most useful.

Two broad categories of statements for creating new variables are illustrated here: 1) IF/THEN statements, and 2) assignment statements. One of the differences between these two categories is where the new variable name is placed. In IF/THEN statements, the new variable is specified at the end of the statements that refer to the existing variable. The new variable name is followed by the equal ("=") sign and the value(s) to be assigned for the new variable. In assignment statements, the new variable is referenced at the beginning of the SAS statement, followed by the "=" sign, and then the existing variable(s).

Descriptions and programs are provided for each of the two categories, illustrating their use on the height/weight data set. Program 1 compares and contrasts the use of IF/THEN statements with an assignment statement that uses the PUT function. It also illustrates the use of an assignment statement to create a dichotomous variable. Program 2 illustrates the use of two other types of assignment statements, one using arithmetic operators and another using the SAS function, SUBSTRING.

PRACTICE QUESTIONS ON DATA MANIPULATION (NEW VARIABLES)

These questions assume that a permanent SAS data set has been created from the sample Clinical data set, including the format file. Examples are given for how program, log, and output might look.

  1. Calculate a new variable (bpratio) that represents a ratio of systolic to dystolic blood pressure. Round it to the nearest single decimal place. Do a frequency distribution of the new variable.

  2. Assuming that the 2-digit diagnosis for the variable prim_dx can be meaningfully collapsed to 1-digit diagnosis, create a new variable (prim_sub)that will only contain the 2nd digit. Check the new variable against the values of the original variable (using PROC FREQ with a LIST MISSING option).

  3. Create a new blood pressure variable (bpnorm) that simply denotes normal/not normal using a dichotomous assignment statement based on both readings of blood pressure. Consider the norm for diastolic to be 60 to 90 and for systolic to be 100 to 140; the norm must be present for both variables. Check the new variable (which will have 1/0 values) against the values of the two original variables.

  4. Create two new heart rate variables (rateif and rateput, each of which groups the same values of heart rate into 3 categories: low (less than 70), moderate (70-85), and high (86 and over). Use IF/THEN statements to create one variable, and the PUT function to create the other. In addition to creating the grouping format required for the latter, create a labeling format for the 3 different groups. Do frequency distributions (labeling the new values) for the 2 variables - they should be identical; however, the differing distributions illustrate the importance of identifying missing values prior to creating new variables and determining how to deal with them.


 << Previous

 Index

 Next >>