Sample data with implicit stratification
sample_implicit.Rd
sample_implicit
draws a random sample of n units from a data.frame in a way that maximizes
variation on variables of interest. For example, it can randomly sample schools in a way that
ensures the sampled schools have as much variation as possible on key characteristics, like the
percent of students of color or average achievement. Implicit stratification is a common method
to sample units in an educational setting: the NCES frequently uses this approach when
deciding who to survey or test, including for
NCES.
Arguments
- data
is the data.frame on which rows will be sampled
- n
is the number of rows to be sampled
- ...
are the variables on which to implicitly stratify. In effect, these are the variables on which the data is first sorted. The order in which the variables are listed matters: the first variable listed will have the most variability in the sampled data, so you should list the variables on which you want to stratify in order of decreasing importance, as the variables listed near the end won't have as large of an effect on the stratification.
- size_var
is a variable indicating the size of the row. This allows you to select a sample that accounts for differences in the size of each unit. For example, if each row represents a school, an appropriate size_var could be the number of students attending the school so that schools serving more students are more likley to be selected. This is important when you are doing multiple stages of sampling, like first sampling schools and then sampling classrooms within schools. Without setting the size_va in this example, each shcool would be equally likely selected, meaning classrooms in small schools would be more likely to be selected because their small school with only a few classrooms has the same chance as being selected as a large school with many classrooms. Default is NULL.
- random_num
is a random number to control the random sampling process so that results are reproducible. Default is 1.
Value
A data.frame with equal size as the original data, but sorted differently and with a new
variabled called in_sample
that is TRUE if the row was selected for the same or FALSE
otherwise.
Details
sample_implicit
implicitly samples units by first sorting the data on the key variables
indicated. It uses a serpentine sort, which alternates between ascending and descending orders
so that any two adjacent rows in the sorted data are as similar as possible. See
serpentine
for more details about serpentine sorting and
vignette("sample_implicit")
for a longer discussion of why it's useful. Serpentine
sorting is commonly used by NCES to achieve implicit stratification.
Examples
# Sample 7 cars after implicitly stratifying on gear and mpg.
sampled_cars <- sample_implicit(data = mtcars, n = 7, am, mpg)
sampled_cars
#> # A tibble: 32 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb in_sample
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 10.4 8 460 215 3 5.42 17.8 0 0 3 4 FALSE
#> 2 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 TRUE
#> 3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4 FALSE
#> 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 FALSE
#> 5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 FALSE
#> 6 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 TRUE
#> 7 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 FALSE
#> 8 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 FALSE
#> 9 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 FALSE
#> 10 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 FALSE
#> # ... with 22 more rows
# Once the sample is complete, it's easy to compare sampled to non-sampled cars
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
sampled_cars %>%
group_by(in_sample) %>%
summarize(mean_mpg = mean(mpg))
#> # A tibble: 2 x 2
#> in_sample mean_mpg
#> <lgl> <dbl>
#> 1 FALSE 20.0
#> 2 TRUE 20.5
# Using implicit stratification gets us more variation on variables of interest than just randomly
# selecting rows. For example, if we chose 3 cars, we might not get variability on the variables
# of interest. In this case, sample_implicit got us more variablity on mpg than a simple random
# sample
set.seed(12)
simplesample <- sample_n(mtcars, 3)
implicitsample <- sample_implicit(data = mtcars, n = 3, am, mpg)
count(simplesample, am)
#> am n
#> 1 0 1
#> 2 1 2
implicitsample %>%
filter(in_sample) %>%
count(am)
#> # A tibble: 2 x 2
#> am n
#> <dbl> <int>
#> 1 0 2
#> 2 1 1
# You'll get different, but reproducible results if you change the random number
sampled_cars1 <- sample_implicit(data = mtcars, n = 5, am, mpg)
sampled_cars2 <- sample_implicit(data = mtcars, n = 5, am, mpg, random_num = 2)
sampled_cars1
#> # A tibble: 32 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb in_sample
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 10.4 8 460 215 3 5.42 17.8 0 0 3 4 FALSE
#> 2 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 TRUE
#> 3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4 FALSE
#> 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 FALSE
#> 5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 FALSE
#> 6 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 FALSE
#> 7 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 FALSE
#> 8 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 FALSE
#> 9 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 TRUE
#> 10 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 FALSE
#> # ... with 22 more rows
sampled_cars2
#> # A tibble: 32 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb in_sample
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 FALSE
#> 2 10.4 8 460 215 3 5.42 17.8 0 0 3 4 TRUE
#> 3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4 FALSE
#> 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 FALSE
#> 5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 FALSE
#> 6 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 FALSE
#> 7 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 FALSE
#> 8 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 TRUE
#> 9 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 FALSE
#> 10 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 FALSE
#> # ... with 22 more rows
# If you have a variable that represents size, it's easy to account for that when selecting
# the sample
sample_implicit(data = mtcars, n = 5, am, mpg, size_var = hp)
#> # A tibble: 32 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb in_sample
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 10.4 8 460 215 3 5.42 17.8 0 0 3 4 FALSE
#> 2 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 TRUE
#> 3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4 FALSE
#> 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 FALSE
#> 5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 FALSE
#> 6 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 FALSE
#> 7 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 TRUE
#> 8 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 FALSE
#> 9 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 FALSE
#> 10 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 FALSE
#> # ... with 22 more rows