3

Resampling and permutation tests in SAS

 3 years ago
source link: https://blogs.sas.com/content/iml/2014/11/21/resampling-in-sas.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Resampling and permutation tests in SAS

10

My colleagues at the SAS & R blog recently posted an example of how to program a permutation test in SAS and R. Their SAS implementation used Base SAS and was "relatively cumbersome" (their words) when compared with the R code. In today's post I implement the permutation test in SAS/IML. This provides an apples-to-apples comparison because both SAS/IML and R are matrix-vector languages.

This permutation test is a simple resampling exercise that could be assigned as a homework problem in a classroom. If you are at a college or university, remember that SAS/IML is available for free for all academic users through the SAS University Edition.

Permutation tests in SAS/IML

The analysis was motivated by a talk about using computational methods to illuminate statistical analyses. The data are the number of mosquitoes that were attracted to human volunteers in an experiment after each volunteer had consumed either a liter of beer (n=25) or water (n=18). The following statements assign the experimental data to two SAS/IML vectors and compute the observed difference between the means of the two groups:

proc iml;
G1 = {27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20};
G2 = {21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20};
obsdiff = mean(G1) - mean(G2);
print obsdiff;

The experimenters observed that, on average, people who drank beer attracted 4.4 more mosquitoes than people who drank water. The statistical question is, "What is the probability of observing a difference of this magnitude (or bigger) by chance if the beverages have no effect?" You can answer this question by using a permutation test to perform a nonparametric version of

the t test. The null hypothesis is that there is no difference between the mean number of mosquitoes that were attracted to each experimental group (beer or water).

The permutation test enables you to generate the null distribution. Draw 25 random observations from the data and assign them to Group 1; assign the other 18 observations to Group 2. Compute the difference between the means of each group. Repeat these two steps many times to approximate the null distribution. The following SAS/IML statements use the SAMPLE function in SAS/IML to permute the data. The permutation step is repeated 9,999 times so that (adding in the original data order) there are a total of 10,000 permutations of the data:

call randseed(12345);                             /* set random number seed */
alldata = G1 // G2;                        /* stack data in a single vector */
N1 = nrow(G1);  N = N1 + nrow(G2);
NRepl = 9999;                                     /* number of permutations */
nulldist = j(NRepl,1);                   /* allocate vector to hold results */
do k = 1 to NRepl;
   x = sample(alldata, N, "WOR");                       /* permute the data */
   nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]);  /* difference of means */
end;
 
title "Histogram of Null Distribution";
refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";
call Histogram(nulldist) other=refline;

The histogram shows the distribution of mean differences that were computed under the assumption of the null hypothesis. The observed difference between the beer and water groups (the vertical red line at 4.38) is way off in the tail. Since the null hypothesis is not a likely explanation for the observed difference, we reject it. We conclude that mosquitoes are attracted differently to the two groups (beer and water).

If you would like to compute the empirical p-value for the null distribution, that is easily accomplished:

pval = (1 + sum(abs(nulldist) >= abs(obsdiff))) / (NRepl+1);
print pval;

t_permutationtest2

Vectorization for permutation tests

Regular readers of my blog know that I advocate vectorizing programs whenever possible. Matrix-vector languages such as SAS/IML, R, and MATLAB work more efficiently when computations inside loops are replaced by vector or matrix computations.

Because of the way that SAS/IML loops are compiled and optimized, using loops in the SAS/IML language is not as detrimental to performance as in some other languages. For example, the previous permutation test code runs in about 0.04 seconds on my PC from 2009. Still, I like to promote vectorization because it can be important to performance.

The following statements eliminate the DO loop and implement the resampling and permutation test in two lines of SAS/IML code. The vectorized computation runs in about one-fourth the time:

x = sample(alldata, N//NRepl, "WOR");               /* create all resamples */
nulldist = x[,1:N1][,:] - x[,(N1+1):N][,:]; /* compute all mean differences */

The vectorized computation uses the colon (:) subscript reduction operator in SAS/IML to compute the mean of the first 25 and the last 18 elements for each set of permuted data.

Additional references for resampling in SAS

To learn more about efficient ways to implement resampling methods such as bootstrapping and permutation tests, consult the following references:

  • For information about bootstrapping in SAS/IML, see pages 14–17 of Wicklin (2008).
  • For another permutation test example, see pages 11–14 of Wicklin (2012).
  • Chapter 15 of my book Simulating Data with SAS describes resampling methods in both Base SAS and SAS/IML. I include useful tips and techniques that make bootstrapping in Base SAS less cumbersome, including a mention of the %BOOT and %BOOTCI macros, which enable you to implement bootstrap methods in Base SAS by using only a few program statements.
  • For an excellent introduction to resampling methods in Base SAS, see Cassell (2007).

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK