How to create macro variable lists using PROC SQL?

 

We’ve created macro variable lists using PROC SQL, both horizontal and vertical lists. Now we’re going to talk about creating those same types of lists using the DATA step. Using the DATA step is a little bit more complex than using PROC SQL, but there are times when it can be advantageous because you have access to the full power and flexibility of the DATA step. First, let’s talk about how to create a horizontal macro variable list in a DATA step. I have this code split across two slides. This is step 1. And this step is optional, but I’m assuming that we want only a list of distinct values.

Since we don’t have the convenience of the DISTINCT keyword like we do in PROC SQL, we can use PROC SORT to get just the unique values of origin. That’s the variable that we’re going to put into our macro variable list in this case. So we do a quick PROC SORT. DATA equal help. cars, OUT equal– and I created a new data set called unique_origins, and I only kept the variable origin. And here’s the key– I used the keyword NODUPKEY which eliminates duplicate key values. In this case, the key is the origin. That’s our BY variable.

This will create a data set that has just the variable origin and just three rows with those three values of origin which are Asia, Europe, and the USA. Once we’ve done this, now we can use a DATA step to build our macro variable list. And we’re simply going to step through this data set and build the list piece by piece. So we start with DATA _null_ because we’re not really interested in creating an output data set here. We have a SET statement, unique_origins– that’s the data set that we just created in PROC SORT– END equals of. love is just a variable name that I made up, but the END equals option on the SET statement will set this variable eof to true only once the last record has been read from unique_origins.

It’ll be false until that point in time. We’ll need that in a little bit. This LENGTH statement simply declares that a new variable called origin list is being created with a length of 200. Now, this is not a macro variable. This is a data set variable– origin_list. It also appears on this RETAIN statement because we’re going to build this origin list across multiple iterations of the DATA step as we read in values from the unique_origins data set. So we need to be able to retain the values across iterations of the DATA step. That’s why we have this RETAIN statement here. Then we have the workhorse of this DATA step– an assignment statement, origin_list equals, and on the right-hand side of the equals we have a call to the CATX function.

Now for those who aren’t familiar with the CATX function, it simply concatenates strings, but it inserts a delimiter. And that’s the first argument of the function delimiter– in this case, tilde. So the CATX function will take these other arguments– origin_list and origin– it will strip them, that is removing trailing and leading blanks, insert this delimiter in between, and concatenate them. Now, the first time through the loop here– the first one through the DATA step when we read in the first record– the origin_underscore list is missing. It’s blank, there’s nothing in it. origin has the first value that we read in from the data set, which was Asia. So when we call CATX, if we give it a missing string it will not in a certain delimiter. It’s smart enough to not do that, which is great. Makes it very convenient.

What we’ll end up with is CATX of missing and Asia will just be Asia. So the first time through, origin_list equals Asia. Then we have this if EOF THEN DO. Well, eof is false because we haven’t read the last record yet. If it’s false, we don’t do any of this stuff in the DO block down. Here and we just we’ve finished our trip through the DATA step, and now we loop back to the top and read another record. We read the second record from unique_origins, which is Europe. And now origin_list equals CATX of origin_list, which has Asia from before– and remember, it’s still there because we use the RETAIN statement– Asia, the tilde gets inserted as a delimiter, and Europe is the new value we just read in.

Now origin_list will have Asia, tilde, Europe. We’re still not at the end of the data set, so this is false. We go back to the top and read the third and final record from unique_origins, which is going to have a value of USA. So now origin_list equals CATX of– and origin_list still has Asia, tilde, Europe in it. And now we got another tilde as a delimiter and then the value USA. So now we’re going to have Asia, tilde, Europe, tilde, USA in a DATA step variable called origin_list. We haven’t created a macro variable yet. This is just a data set variable.

Now eof is true. We are at the end of the file. And since we’re at the end of the file we are going to execute these two states that are in the DO loop– the DO block, I should say. So we have a CALL SYMPUTX– remember, CALL SYMPUTX is what we use to create macro variables during a DATA step execution. We’re going to create a macro variable called origin_list, and what we’re going to put in it is the value of the DATA step variable origin_list. That’s the variable we just created. So this is a little confusing because these have the same name, but these are two different things.

We’re creating a macro variable called origin_list, and we’re putting in it the value of the DATA step variable called origin_list. Then we’re going to create another macro variable called numorigins, and into it, we’re going to put the value _n_. _n_ is an automatic variable that is created for us by the DATA step, and it simply counts the number of iterations that the DATA step is executed, which you could typically think of as a counter of which record we’re on, assuming we’re just reading in one record per loop. So at this point in time _n_ is 3, because we’re on our third trip through the implied loop of the DATA step. So the number 3 is going to get stored in numorigins.

We’ve created two macro variables here– one called origin_list which has Asia, tilde, Europe, tilde, USA, and one called numorigins which has a 3 in it. For exercise 5 we’re going to create a horizontal macro variable list using a DATA step. Once again we’re going to deal with the help.class data set, and we’re going to take the names of all the students separated by tildes and put them into a single macro variable called student_list. And then we’ll also store the number of students in another macro variable called num_students. And we can skip the PROC SORT step for this exercise because we know the student names are unique. There’s only one record per student name in the data set, so we don’t need to worry about removing the duplicates.

Leave a Comment