Sample data with implicit stratification — sample

sample_implicit draws a random sample of n units from a data.frame in a way that maximizes variation on variables of interest. For example, it can randomly sample schools in a way that ensures the sampled schools have as much variation as possible on key characteristics, like the percent of students of color or average achievement. Implicit stratification is a common method to sample units in an educational setting: the NCES frequently uses this approach when deciding who to survey or test, including for NCES.

Usage

sample_implicit(data, n, ..., size_var = NULL, random_num = 1)

Arguments

data: is the data.frame on which rows will be sampled
n: is the number of rows to be sampled
...: are the variables on which to implicitly stratify. In effect, these are the variables on which the data is first sorted. The order in which the variables are listed matters: the first variable listed will have the most variability in the sampled data, so you should list the variables on which you want to stratify in order of decreasing importance, as the variables listed near the end won't have as large of an effect on the stratification.
size_var: is a variable indicating the size of the row. This allows you to select a sample that accounts for differences in the size of each unit. For example, if each row represents a school, an appropriate size_var could be the number of students attending the school so that schools serving more students are more likley to be selected. This is important when you are doing multiple stages of sampling, like first sampling schools and then sampling classrooms within schools. Without setting the size_va in this example, each shcool would be equally likely selected, meaning classrooms in small schools would be more likely to be selected because their small school with only a few classrooms has the same chance as being selected as a large school with many classrooms. Default is NULL.
random_num: is a random number to control the random sampling process so that results are reproducible. Default is 1.

Value

A data.frame with equal size as the original data, but sorted differently and with a new variabled called in_sample that is TRUE if the row was selected for the same or FALSE otherwise.

Details

sample_implicit implicitly samples units by first sorting the data on the key variables indicated. It uses a serpentine sort, which alternates between ascending and descending orders so that any two adjacent rows in the sorted data are as similar as possible. See serpentine for more details about serpentine sorting and vignette("sample_implicit") for a longer discussion of why it's useful. Serpentine sorting is commonly used by NCES to achieve implicit stratification.

Examples

# Sample 7 cars after implicitly stratifying on gear and mpg.
sampled_cars <- sample_implicit(data = mtcars, n = 7, am, mpg)
sampled_cars
#> # A tibble: 32 x 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb in_sample
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>    
#>  1  10.4     8  460    215  3     5.42  17.8     0     0     3     4 FALSE    
#>  2  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4 TRUE     
#>  3  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4 FALSE    
#>  4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 FALSE    
#>  5  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4 FALSE    
#>  6  15.2     8  276.   180  3.07  3.78  18       0     0     3     3 TRUE     
#>  7  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2 FALSE    
#>  8  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2 FALSE    
#>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3 FALSE    
#> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3 FALSE    
#> # ... with 22 more rows

# Once the sample is complete, it's easy to compare sampled to non-sampled cars
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
sampled_cars %>%
  group_by(in_sample) %>%
  summarize(mean_mpg = mean(mpg))
#> # A tibble: 2 x 2
#>   in_sample mean_mpg
#>   <lgl>        <dbl>
#> 1 FALSE         20.0
#> 2 TRUE          20.5

# Using implicit stratification gets us more variation on variables of interest than just randomly
# selecting rows. For example, if we chose 3 cars, we might not get variability on the variables
# of interest. In this case, sample_implicit got us more variablity on mpg than a simple random
# sample
set.seed(12)
simplesample <- sample_n(mtcars, 3)
implicitsample <- sample_implicit(data = mtcars, n = 3, am, mpg)
count(simplesample, am)
#>   am n
#> 1  0 1
#> 2  1 2
implicitsample %>%
  filter(in_sample) %>%
  count(am)
#> # A tibble: 2 x 2
#>      am     n
#>   <dbl> <int>
#> 1     0     2
#> 2     1     1

# You'll get different, but reproducible results if you change the random number
sampled_cars1 <- sample_implicit(data = mtcars, n = 5, am, mpg)
sampled_cars2 <- sample_implicit(data = mtcars, n = 5, am, mpg, random_num = 2)
sampled_cars1
#> # A tibble: 32 x 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb in_sample
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>    
#>  1  10.4     8  460    215  3     5.42  17.8     0     0     3     4 FALSE    
#>  2  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4 TRUE     
#>  3  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4 FALSE    
#>  4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 FALSE    
#>  5  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4 FALSE    
#>  6  15.2     8  276.   180  3.07  3.78  18       0     0     3     3 FALSE    
#>  7  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2 FALSE    
#>  8  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2 FALSE    
#>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3 TRUE     
#> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3 FALSE    
#> # ... with 22 more rows
sampled_cars2
#> # A tibble: 32 x 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb in_sample
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>    
#>  1  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4 FALSE    
#>  2  10.4     8  460    215  3     5.42  17.8     0     0     3     4 TRUE     
#>  3  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4 FALSE    
#>  4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 FALSE    
#>  5  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4 FALSE    
#>  6  15.2     8  276.   180  3.07  3.78  18       0     0     3     3 FALSE    
#>  7  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2 FALSE    
#>  8  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2 TRUE     
#>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3 FALSE    
#> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3 FALSE    
#> # ... with 22 more rows

# If you have a variable that represents size, it's easy to account for that when selecting
# the sample
sample_implicit(data = mtcars, n = 5, am, mpg, size_var = hp)
#> # A tibble: 32 x 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb in_sample
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>    
#>  1  10.4     8  460    215  3     5.42  17.8     0     0     3     4 FALSE    
#>  2  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4 TRUE     
#>  3  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4 FALSE    
#>  4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 FALSE    
#>  5  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4 FALSE    
#>  6  15.2     8  276.   180  3.07  3.78  18       0     0     3     3 FALSE    
#>  7  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2 TRUE     
#>  8  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2 FALSE    
#>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3 FALSE    
#> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3 FALSE    
#> # ... with 22 more rows