0

36383 - Randomly assign the observations in a data set to two or more groups

 1 year ago
source link: https://support.sas.com/kb/36/383.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

SAS® 9.4 TS1M1 or later

Beginning with SAS/STAT® 13.1 in SAS 9.4 TS1M1, the GROUPS= option in the PROC SURVEYSELECT statement randomly assigns observations to groups. If you specify a number of groups, then the numbers of observations assigned to the groups are equal or as equal as possible. You also have the ability to specify different group sizes for the random assignments in the GROUPS= option.

For example, suppose you want to divide the ten observations in the following data set into three groups.

      data one;
        do x=1 to 10;
          output;
        end;
        run;

Specifying the GROUPS=3 option in PROC SURVEYSELECT divides the ten observations into three groups as evenly as possible. The results of this example can be reproduced by specifying the same value in the SEED= option.

      proc surveyselect data=one groups=3 seed=49201 out=RandomGroups noprint;
        run;
      proc freq data=RandomGroups;
        tables GroupID;
        run;
Group ID Number
GroupID Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 3 30.00 3 30.00
2 3 30.00 6 60.00
3 4 40.00 10 100.00

Releases before SAS® 9.4 TS1M1

Prior to SAS/STAT 13.1, you can use PROC SURVEYSELECT to randomly divide a data set into two groups as described in this note. For more than two groups, you can use PROC PLAN to randomly assign each observation to a group such that the groups are of equal size, or as equal as possible when the data set is not evenly divisible by the number of groups.

For example, suppose you want to divide the ten observations in data set ONE (above) into three groups. These statements create data set A consisting of four sets of three observations. Each set contains a random arrangement of the values 1, 2, and 3. Since three groups are desired, specify GROUP=3. To accommodate ten observations, you need four sets, so specify SET=4. The results of this example can be reproduced by specifying the same value in the SEED= option.

      proc plan seed=4233;
        factors set=4 group=3 / noprint;
        output out=a;
        run;

In the following DATA step, the RANUNI function is used to add a random number between 0 and 1 to each observation. Again, by using the same seed the results of this example can be reproduced. The IF statement removes the two extra observations created by PROC PLAN.

      data a; 
        set a; 
        random=ranuni(2342);
        if _n_>10 then stop;
        run;

Sorting by the random variable randomizes the group numbers across the entire data set.

      proc sort data=a; 
        by random;
        run;

The final data set consisting of the ten observations with assigned group numbers is created by merging the randomized data set of group numbers with the original data set.

      data RandomGroups;
        merge one a;
        run;
      proc print;
        id x;
        var group;
        run;
x group
1 1
2 2
3 2
4 3
5 2
6 3
7 1
8 3
9 2
10 1

PROC FREQ can be used to verify the sizes of the groups.

      proc freq data=RandomGroups;
        tables group;
        run;
group Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 3 30.00 3 30.00
2 4 40.00 7 70.00
3 3 30.00 10 100.00

Note that groups 1 and 3 each have three observations and group 2 was randomly given a fourth observation. The group assignment for each observation is completely random.

Unknown Number of Observations

Suppose you want each consecutive set of G observations to randomly assign one observation to each group, where G is the number of groups. This is often desired when the total number of observations is not initially known. In this example, if you did not know how many observations you would end up with, you might want to randomly assign the first three observations to each of the groups and continue to do the same for each set of three observations as they become available. Do this by specifying a sufficiently large value for SET in the PLAN step above and omit the DATA and SORT steps that follow.

Suppose you have twelve subjects and want to assign them to three groups. You expect an unknown number of additional subjects to become available that will also need to be randomly assigned. When all observations are collected, you want the groups sizes to be as equal as possible. The following statements produce random assignments, in sets of three, for up to 10 × 3 = 30 observations. As subjects become available after the 12th, they can be assigned to groups according to the plan. Each additional set of three observations is randomly assigned one to a group.

      data one;
        do id=1 to 12;
          output;
        end;
        run;
      proc plan seed=58349;
        factors set=10 group=3 / noprint;
        output out=a;
        run;
      data RandomGroups;
        merge one a;
        run;
      proc print;
        id id;
        var group;
        run;

If more than 30 observations become available, simply run PROC PLAN again to generate more randomized sets of three.

      proc plan seed=39352;
        factors set=10 group=3 / noprint;
        output out=a;
        run;
      proc print noobs;
        var group;
        run;

Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2008
Microsoft Windows XP Professional
Windows Millennium Edition (Me)
Windows Vista
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK