Sampling for Common Metrics

In most TNTP projects, we collect data on common metrics to understand the extent to which students, families, teachers, or others in an entire school, district, network, or state, have access to a valuable resource. It’s often not practical to collect all the possible data – we can’t collect every assignment students in a school receive over an entire year, we can’t observe every lesson, we can’t survey every teacher – so we gather data on the common metrics for a subset of classrooms, teachers, etc. From whom or from which classrooms data is collected is consequential; if we do not do so randomly, the common metric data might not be representative of the broader population on which we’re hoping to make inferences.

Yet, if we choose participants randomly, we risk getting a random sample of data that does not contain enough variation to make equity or group comparisons. For example, if we randomly choose 5 schools from a district of 100, there is a chance we randomly choose primary schools only, or get five schools that look similarly demographically. It would be better if we could randomly pick schools while at the same time maximizing the chance the schools vary on the characteristics that matter the most to us and the outcome(s) we’re measuring.

This is exactly what sampling with implicit stratification does, and it’s why serious educational research organizations like NCES use it often. The tntpmetrics package contains tools for you to easily draw your own samples with implicit stratification. This will help ensure the common metric data you collect is representative of your target population and will allow you to compare results across equity groups.

Practice data: cms_data

To demonstrate how to apply the implicit stratification sampling functions in tntpmetrics, we will use school-level data on all public schools in the Charlotte-Mecklenberg School district in the 2018-2019 school year. This data, cms_data, has already been cleaned and processed as part of an old Academic Diagnostic contract. This cleaning includes filling in (or imputing) missing values for some newer schools with the district mean so that the data has no missingness.

Included in this data is a categorical variable indicating grade-levels served (Primary, Middle, High, or Other), the proportion of students receiving free or reduced-price lunch, the proportion of students of color, the state school accountability score (spg_score) and letter grade(spg_grade), and the number of enrolled students.

head(cms_data)
#> # A tibble: 6 x 7
#>   school_name        grade_level_cat frl_percent soc_percent spg_score spg_grade
#>   <chr>              <chr>                 <dbl>       <dbl>     <dbl> <chr>    
#> 1 EASTOVER ELEMENTA~ Primary               0.248       0.323        66 C        
#> 2 MCALPINE ELEMENTA~ Primary               0.229       0.348        67 C        
#> 3 BARNETTE ELEMENTA~ Primary               0.239       0.355        75 B        
#> 4 TRILLIUM SPRINGS ~ Primary               0.211       0.629        76 B        
#> 5 TORRENCE CREEK EL~ Primary               0.206       0.339        77 B        
#> 6 BALLANTYNE ELEMEN~ Primary               0.152       0.477        79 B        
#> # ... with 1 more variable: total_students_2018 <dbl>

How does implicit stratification work? It’s mostly just sorting data

To learn more about the specifics of implicit stratification, read the TNTP memo “placeholder_memo_title” (link to Adam’s, potentially revised, memo). In short, we sort the by the variable(s) on which we want to ensure variation, pick a random starting spot on this sorted list, and then count every k rows to get all of our selected units. The number of steps k represents is based on how many units are sampled. For example, there are 170 schools in the CMS data set. To sample 10 schools, pick a random starting row between 1 and 17, and then count every 17th school until the end of the data. Each school picked along the way is part of the sample.

Importantly, because we first sorted the data on the characteristics we care about, we ensured that schools are the most spread out they could be on these variables – i.e., the sort ensures that schools at the top of the data set look different on these variables than schools on the bottom. Thus, when we walk through the data, counting every 17th school, it’s unlikely we get a sample of schools that all look the same on these characteristics. In fact, the only time this could happen is if schools in the district don’t actually vary much on the characteristics we specify.

To be concrete, let’s manually draw an implicitly stratified sample of 10 schools from the cms_data, where we want to ensure schools vary on the percent of students of color.

# Sort on soc_percent
sorted_data <- cms_data %>%
  arrange(soc_percent)

# Pick a random number between 1 and 17, and then count every 17 integers ten times
set.seed(1)
random_start <- sample(1:17, 1)
selected_rows <- seq(from = random_start, by = 17, length.out = 10)

# Keep the rows corresponding to the selected rows
implicit_sampled_data <- sorted_data %>%
  slice(selected_rows)

Let’s see how much variability we have on soc_percent in our sampled data, compared to just randomly selecting 10 schools without implicit stratification.

# Randomly pick 10 schools without stratification
set.seed(1)
nonimplicit_sampled_data <- cms_data %>%
  sample_n(10)

# Use Standard Deviation to compare variability on percent_soc
sd(implicit_sampled_data$soc_percent)
#> [1] 0.2728191
sd(nonimplicit_sampled_data$soc_percent)
#> [1] 0.2947937

The standard deviation of soc_percent in the implicit sample is higher than in the simple random sample.

Sorting on more than one variable: Serpentine sorting

Drawing an implicit stratified sample is easy with just one characteristic of interest. We just did it manually above! It might seem just as easy with two characteristics of interest, as we could just sort on both variables. But look what happens when we sort the cms_data by grade-level and proportion of students receiving FRL.

cms_data %>%
  select(school_name, grade_level_cat, frl_percent, spg_score) %>%
  arrange(grade_level_cat, frl_percent) %>%
  slice(24:29)
#> # A tibble: 6 x 4
#>   school_name             grade_level_cat frl_percent spg_score
#>   <chr>                   <chr>                 <dbl>     <dbl>
#> 1 WEST MECKLENBURG HIGH   High                 0.998         54
#> 2 HARDING UNIVERSITY HIGH High                 0.998         52
#> 3 GARINGER HIGH           High                 0.999         59
#> 4 JAY M ROBINSON MIDDLE   Middle               0.0994        92
#> 5 SOUTH CHARLOTTE MIDDLE  Middle               0.135         86
#> 6 COMMUNITY HOUSE MIDDLE  Middle               0.166         85

The data is first sorted (alphabetically) on grade-level, and then within each grade-level the data is sorted in ascending order on the proportion of students receiving FRL. The code above shows where in the data it goes from high schools to middle schools. Notice how the proportion of FRL students makes a huge jump from Gariner High to Jay M Robinson Middle. Once the data gets to the next grade-level category, it must start over again from the lowest values of frl_percent. This leads to large differences in concurrent rows at these junctures. Though these schools already differ by grade-level, this disconnect means they’re very different on other key factors, too. Not only do they differ on their FRL proportion, but because this variable is linked to other characteristics, we see that they’re also very different on something like school achievement. These disconnected junctures in the sort are magnified if we sort by even more than two variables.

Instead, we want concurrent rows of the data to be as similar as possible on the characteristics of interest. This is what helps us obtain the most variability possible in our sample. To do this, we must use a sorting approach known as serpentine, where instead of always starting over with an ascending sort at each juncture, we alternate between ascending and descending sorts to ensure the juncture keeps the values as similar as possible. tntpmetrics has a built in function to perform serpentine sorts. Let’s apply it to the CMS data and look at the same juncture as above.

cms_data %>%
  select(school_name, grade_level_cat, frl_percent, spg_score) %>%
  serpentine(grade_level_cat, frl_percent) %>%
  slice(24:29)
#> # A tibble: 6 x 4
#>   school_name             grade_level_cat frl_percent spg_score
#>   <chr>                   <chr>                 <dbl>     <dbl>
#> 1 WEST MECKLENBURG HIGH   High                  0.998        54
#> 2 HARDING UNIVERSITY HIGH High                  0.998        52
#> 3 GARINGER HIGH           High                  0.999        59
#> 4 ALBEMARLE ROAD MIDDLE   Middle                0.997        50
#> 5 RANSON MIDDLE           Middle                0.997        42
#> 6 MCCLINTOCK MIDDLE       Middle                0.997        55

Now the first middle school in the list is the one with the highest proportion of students receiving FRL. The implicit stratification sample function in tntpmetrics uses this serpentine sort, so you do not need to use the function serpentine directly, though it’s available to you as part of the package if you need it for other purposes, or you need to modify the default sampling approach. It works much like dplyr::arrange: simply list the variables you want to serpentine sort.

The order of sorting variables matters

Even with serpentine sorting, you will get a differently sorted list if you first sort on grade-level and then FRL percent compared to first sorting on FRL percent and then grade-level. The reasons are obvious, but it’s important to point out. In general, it’s best to first sort on the variables that are most important to you, as this will be the variable on which your data is most spread out. Then, choose variables in order from most to least important. If you sort on many variables, the last variables you include will have a smaller effect on the position, as much of the data order is already “locked in”.

Making sorts categorical

What if we wanted to ensure variability on both percent FRL and SPG Score in a school? The obvious answer is to just include them both in a sort.

cms_data %>%
  select(school_name, grade_level_cat, frl_percent, spg_score) %>%
  serpentine(frl_percent, spg_score) %>%
  slice(1:10)
#> # A tibble: 10 x 4
#>    school_name                  grade_level_cat frl_percent spg_score
#>    <chr>                        <chr>                 <dbl>     <dbl>
#>  1 PROVIDENCE SPRING ELEMENTARY Primary              0.0422        90
#>  2 ELON PARK ELEMENTARY         Primary              0.0624        88
#>  3 POLO RIDGE ELEMENTARY        Primary              0.0655        88
#>  4 ELIZABETH LANE ELEM          Primary              0.094         83
#>  5 JAY M ROBINSON MIDDLE        Middle               0.0994        92
#>  6 ARDREY KELL HIGH             High                 0.103         92
#>  7 PROVIDENCE HIGH              High                 0.104         93
#>  8 HAWK RIDGE ELEMENTARY        Primary              0.112         86
#>  9 PARK ROAD MONTESSORI         Primary              0.112         86
#> 10 SOUTH CHARLOTTE MIDDLE       Middle               0.135         86

Notice how the SPG score barely seems sorted at all. That is because the data is first sorted on FRL percent, and because this variable is nearly unique to each school: once the data is sorted on it, there is very little to no room left to sort anything else.

When you want to include multiple non-categorical variables in your implicit stratification, you have two options:

Don’t do it; just pick one non categorical variable. This approach might be worthwhile if the two variables of interest are closely related. In these cases, adding the second variable doesn’t add much to the stratification. But if the variables are not closely related and it’s not not critically important to keep one of the variables as is, your best option is to
Make the variables categorical. This is the more common approach as it allows you to keep both variables in the stratification process.

In the CMS example, let’s turn percent FRL and percent SOC into categorical variables.

cms_data %<>%
  mutate(
    frl_cat = case_when(
      frl_percent < 0.25 ~ "< 25%",
      frl_percent < 0.50 ~ "25-50%",
      frl_percent < 0.75 ~ "50-75%",
      frl_percent <= 1.00 ~ "> 75%"
    ),
    frl_cat = ordered(frl_cat, levels = c("< 25%", "25-50%", "50-75%", "> 75%")),
    soc_cat = case_when(
      soc_percent < 0.25 ~ "< 25%",
      soc_percent <= 0.50 ~ "25-50%",
      soc_percent <= 0.75 ~ "50-75%",
      soc_percent <= 1.00 ~ "> 75%"
    ),
    soc_cat = ordered(soc_cat, levels = c("< 25%", "25-50%", "50-75%", "> 75%"))
  )

We can then try the sort again.

cms_data %>%
  select(school_name, grade_level_cat, frl_cat, spg_score) %>%
  serpentine(frl_cat, spg_score) %>%
  slice(1:10)
#> # A tibble: 10 x 4
#>    school_name                 grade_level_cat frl_cat spg_score
#>    <chr>                       <chr>           <ord>       <dbl>
#>  1 EASTOVER ELEMENTARY         Primary         < 25%          66
#>  2 MCALPINE ELEMENTARY         Primary         < 25%          67
#>  3 BARNETTE ELEMENTARY         Primary         < 25%          75
#>  4 TRILLIUM SPRINGS MONTESSORI Primary         < 25%          76
#>  5 DAVIDSON ELEMENTARY         Primary         < 25%          76
#>  6 TORRENCE CREEK ELEMENTARY   Primary         < 25%          77
#>  7 SHARON ELEMENTARY           Primary         < 25%          79
#>  8 BALLANTYNE ELEMENTARY       Primary         < 25%          79
#>  9 OLDE PROVIDENCE ELEMENTARY  Primary         < 25%          80
#> 10 J.V. WASHAM ELEMENTARY      Primary         < 25%          81

There, things look better. It’s okay to implicitly stratify by a non-categorical variable, but it usually should be your last variable listed, and in most cases you should only have one of these types of variables.

Drawing an implicit sample

With an understanding of how implicitly stratified sampling works, actually drawing the sample is straightforward. Use the sample_implicit function on your data and indicate how many rows you want sampled. You do not need to sort the data ahead of time with this function, as it contains a place to indicate the variables on which you want to implicitly stratify. After running this function, you will get your same data back (sorted appropriately) with a new variable called in_sample that is TRUE if the row was selected and FALSE if not.

For example, let’s sample 20 CMS schools after implicitly stratifying on grade-level, proportion FRL, proportion SOC, and the SPG score.

cms_sample <- sample_implicit(
  data = cms_data,
  n = 20,
  grade_level_cat, frl_cat, soc_cat, spg_score
)
cms_sample %>%
  select(school_name, in_sample) %>%
  slice(1:10)
#> # A tibble: 10 x 2
#>    school_name                in_sample
#>    <chr>                      <lgl>    
#>  1 WILLIAM AMOS HOUGH HIGH    FALSE    
#>  2 LEVINE MIDDLE COLLEGE HIGH FALSE    
#>  3 PROVIDENCE HIGH            TRUE     
#>  4 ARDREY KELL HIGH           FALSE    
#>  5 OLYMPIC HIGH               FALSE    
#>  6 MALLARD CREEK HIGH         FALSE    
#>  7 CATO MIDDLE COLLEGE HIGH   FALSE    
#>  8 SOUTH MECKLENBURG HIGH     FALSE    
#>  9 BUTLER HIGH                FALSE    
#> 10 HOPEWELL HIGH              FALSE

To get just the 20 sampled schools, you can filter on in_sample

cms_sample %>%
  filter(in_sample == T)
#> # A tibble: 20 x 10
#>    school_name       grade_level_cat frl_percent soc_percent spg_score spg_grade
#>    <chr>             <chr>                 <dbl>       <dbl>     <dbl> <chr>    
#>  1 PROVIDENCE HIGH   High                 0.104        0.275        93 A        
#>  2 MYERS PARK HIGH   High                 0.308        0.396        83 B        
#>  3 CHARLOTTE TEACHE~ High                 0.606        0.819        77 B        
#>  4 JAMES MARTIN MID~ Middle               0.987        0.962        37 F        
#>  5 NORTHRIDGE MIDDLE Middle               0.996        0.953        58 C        
#>  6 FRANCIS BRADLEY ~ Middle               0.352        0.471        73 B        
#>  7 BAILEY MIDDLE     Middle               0.189        0.248        87 A        
#>  8 LAWRENCE ORR ELE~ Primary              0.996        0.980        69 C        
#>  9 MERRY OAKS INTER~ Primary              0.973        0.979        62 C        
#> 10 HUNTINGTOWNE FAR~ Primary              0.987        0.914        59 C        
#> 11 HIDDEN VALLEY EL~ Primary              0.993        0.986        55 C        
#> 12 GOVERNORS VILLAG~ Primary              0.996        0.969        52 D        
#> 13 OAKDALE ELEMENTA~ Primary              0.995        0.956        46 D        
#> 14 ALLENBROOK ELEME~ Primary              0.994        0.964        36 F        
#> 15 ELIZABETH TRADIT~ Primary              0.5          0.782        70 B        
#> 16 MATTHEWS ELEMENT~ Primary              0.340        0.381        75 B        
#> 17 MYERS PARK TRADI~ Primary              0.404        0.637        70 B        
#> 18 IRWIN ACADEMIC C~ Primary              0.256        0.763        83 B        
#> 19 BAIN ELEMENTARY   Primary              0.226        0.299        83 B        
#> 20 PROVIDENCE SPRIN~ Primary              0.0422       0.301        90 A        
#> # ... with 4 more variables: total_students_2018 <dbl>, frl_cat <ord>,
#> #   soc_cat <ord>, in_sample <lgl>

This makes comparing sampled schools to non-sampled schools easy.

cms_sample %>%
  group_by(in_sample) %>%
  summarize(mean_frl= mean(frl_percent))
#> # A tibble: 2 x 2
#>   in_sample mean_frl
#>   <lgl>        <dbl>
#> 1 FALSE        0.647
#> 2 TRUE         0.612

Accounting for size differences

The implicit stratification approach outlined above treats each row of data equally. In many cases, that makes sense, for example if each row was a teacher or a classroom. However, in some cases each row of data might represent a substantially different number of subunits, like classrooms or students. This is often the case when sampling schools. Treating each school equally means that students in small schools are more likely to be studied than students in large schools because the former’s school is equally likely to be chosen as the latter, but once the school is chosen students in small schools are more likely to have one of their classrooms visited given the fewer options there.

sample_implicit includes an option to specify a variable in your data representing a measure of size. It will then account for this size and choose schools with a probability proportional to its size.

In the CMS data, we can use the total number of students in the schools as a measure of size and account for this when we draw the sample

cms_sample_withsize <- sample_implicit(
  data = cms_data,
  n = 20,
  grade_level_cat, frl_cat, soc_cat, spg_score,
  size_var = total_students_2018
)
cms_sample_withsize %>%
  filter(in_sample == T)
#> # A tibble: 20 x 10
#>    school_name       grade_level_cat frl_percent soc_percent spg_score spg_grade
#>    <chr>             <chr>                 <dbl>       <dbl>     <dbl> <chr>    
#>  1 LEVINE MIDDLE CO~ High                 0.205        0.426        99 A        
#>  2 SOUTH MECKLENBUR~ High                 0.432        0.615        79 B        
#>  3 PERFORMANCE LEAR~ High                 0.609        0.742        45 D        
#>  4 VANCE HIGH        High                 0.990        0.966        61 C        
#>  5 JAMES MARTIN MID~ Middle               0.987        0.962        37 F        
#>  6 MCCLINTOCK MIDDLE Middle               0.997        0.882        55 C        
#>  7 ALEXANDER GRAHAM~ Middle               0.309        0.404        68 C        
#>  8 SOUTH CHARLOTTE ~ Middle               0.135        0.294        86 A        
#>  9 COCHRANE COLLEGI~ Other                0.996        0.978        38 F        
#> 10 JOSEPH W GRIER A~ Primary              0.999        0.958        64 C        
#> 11 RIVER OAKS ACADE~ Primary              0.995        0.873        59 C        
#> 12 J H GUNN ELEMENT~ Primary              0.996        0.895        56 C        
#> 13 GOVERNORS VILLAG~ Primary              0.996        0.969        52 D        
#> 14 HORNETS NEST ELE~ Primary              0.995        0.970        46 D        
#> 15 VAUGHAN ACADEMY ~ Primary              0.606        0.819        59 C        
#> 16 LONG CREEK ELEME~ Primary              0.530        0.728        62 C        
#> 17 ENDHAVEN ELEMENT~ Primary              0.286        0.503        75 B        
#> 18 MALLARD CREEK EL~ Primary              0.479        0.899        59 C        
#> 19 ELIZABETH LANE E~ Primary              0.094        0.27         83 B        
#> 20 PROVIDENCE SPRIN~ Primary              0.0422       0.301        90 A        
#> # ... with 4 more variables: total_students_2018 <dbl>, frl_cat <ord>,
#> #   soc_cat <ord>, in_sample <lgl>

Notice how incorporating size increased the number of secondary schools in the sample. Because these schools typically contain more students, including size increased the chance they’d be chosen.

Other sampling approaches

Implicit stratified sampling is not the only technique. But rather than create functions to account for all sampling situations, many other approaches can be accomplished with existing R functions and packages, or by combining them with sample_implicit.

Simple random sampling

If you don’t need or want to select a stratified sample, it’s easiest to use already available functions. For a simple random sample – i.e., the equivalent of throwing each row of data into a hat and selecting n of them at random – I recommend sample_n from dplyr. This is the approach used earlier in the vignette to choose 10 random schools

set.seed(123)
cms_data %>%
  sample_n(10)
#> # A tibble: 10 x 9
#>    school_name       grade_level_cat frl_percent soc_percent spg_score spg_grade
#>    <chr>             <chr>                 <dbl>       <dbl>     <dbl> <chr>    
#>  1 MALLARD CREEK HI~ High                  0.455       0.843        75 B        
#>  2 PARK ROAD MONTES~ Primary               0.112       0.319        86 A        
#>  3 REEDY CREEK ELEM~ Primary               0.721       0.910        63 C        
#>  4 ALBEMARLE ROAD M~ Middle                0.997       0.936        50 D        
#>  5 HUNTERSVILLE ELE~ Primary               0.305       0.349        83 B        
#>  6 CHARLOTTE ENGINE~ Other                 0.454       0.784        83 B        
#>  7 COCHRANE COLLEGI~ Other                 0.996       0.978        38 F        
#>  8 PERFORMANCE LEAR~ High                  0.609       0.742        45 D        
#>  9 NATIONS FORD ELE~ Primary               0.996       0.979        59 C        
#> 10 RIVER OAKS ACADE~ Primary               0.995       0.873        59 C        
#> # ... with 3 more variables: total_students_2018 <dbl>, frl_cat <ord>,
#> #   soc_cat <ord>

Explicitly stratifying

Explicitly stratifying a sample typically means selecting a fixed number of units for explicit groups (or strata). For example, if I wanted to ensure I selected 3 schools of each grade-level type, I’d first split the data by grade-type, then randomly select schools in each group. Again, it’s easiest to implement this with already available tools: by combining sample_n and group_by from dplyr.

set.seed(123)
cms_data %>%
  group_by(grade_level_cat) %>%
  sample_n(3)
#> # A tibble: 12 x 9
#> # Groups:   grade_level_cat [4]
#>    school_name       grade_level_cat frl_percent soc_percent spg_score spg_grade
#>    <chr>             <chr>                 <dbl>       <dbl>     <dbl> <chr>    
#>  1 NORTH MECKLENBUR~ High                  0.536       0.821        61 C        
#>  2 MALLARD CREEK HI~ High                  0.455       0.843        75 B        
#>  3 ROCKY RIVER HIGH  High                  0.636       0.923        53 D        
#>  4 EASTWAY MIDDLE    Middle                0.997       0.962        40 D        
#>  5 MCCLINTOCK MIDDLE Middle                0.997       0.882        55 C        
#>  6 SOUTHWEST MIDDLE~ Middle                0.545       0.770        49 D        
#>  7 HAWTHORNE ACADEM~ Other                 0.974       0.926        52 C        
#>  8 CHARLOTTE ENGINE~ Other                 0.454       0.784        83 B        
#>  9 COCHRANE COLLEGI~ Other                 0.996       0.978        38 F        
#> 10 PARK ROAD MONTES~ Primary               0.112       0.319        86 A        
#> 11 MOUNTAIN ISLAND ~ Primary               0.584       0.718        59 C        
#> 12 NATIONS FORD ELE~ Primary               0.996       0.979        59 C        
#> # ... with 3 more variables: total_students_2018 <dbl>, frl_cat <ord>,
#> #   soc_cat <ord>

This approach guarantees a specific number of schools for each strata. This differs from implicitly stratifying, where one will get a sample that have units (e.g., schools) from the different strata, but there is no preordained or set number of units selected from each strata - in fact, if some strata contains few units then there there is a chance zero units from some strata will be chosen. Explicitly stratifying the data ensures each stratum is represented. This could be good or bad: if you force a unit to be chosen from each stratum, then the data may no longer be representative if the strata vary wildly in size. One would need to apply weights to the collected data to get back to representative values, which is a process seldom used at TNTP. At the same time, forcing a unit to be chosen from each stratum ensures no group is excluded, even if the data taken together is no longer representative of the overall population. Which approach to use depends on the needs of the project. If you’re looking to get data that is representative of an entire school, district, network, etc. implicitly stratifying the data will likely get you there more easily.

Combining explicit and implicit stratification using `purrr::map`

Using existing functions, we can also combine explicit and implicit stratification. For example, what if we wanted three schools from each grade-level type, but within each grade-level we did not want to pick schools randomly; instead we wanted to implicitly stratify on the proportion of students receiving FRL and the SPG score. We can combine both stratification approaches with the help of the map function from the purrr package.

To implement, we split the data by grade-level type, then apply the sample_implicit function separately to each grade-type. We’ll use the map_df function to combine the results back into a single data.frame all in one step.

library(purrr)
cms_data %>%
  split(.$grade_level_cat) %>%
  map_df(~ sample_implicit(.x, n = 3, frl_cat, spg_score)) %>%
  filter(in_sample)  %>%
  select(school_name, grade_level_cat)
#> # A tibble: 12 x 2
#>    school_name                              grade_level_cat
#>    <chr>                                    <chr>          
#>  1 PROVIDENCE HIGH                          High           
#>  2 HOPEWELL HIGH                            High           
#>  3 CHARLOTTE TEACHER EARLY COLLEGE- UNCC    High           
#>  4 SOUTH CHARLOTTE MIDDLE                   Middle         
#>  5 J M ALEXANDER MIDDLE                     Middle         
#>  6 NORTHEAST MIDDLE                         Middle         
#>  7 CHARLOTTE ENGINEERING EARLY COLLEGE-UNCC Other          
#>  8 NORTHWEST SCHOOL OF THE ARTS             Other          
#>  9 COCHRANE COLLEGIATE ACADEMY              Other          
#> 10 J.V. WASHAM ELEMENTARY                   Primary        
#> 11 REEDY CREEK ELEMENTARY                   Primary        
#> 12 HIGHLAND RENAISSANCE ACADEMY             Primary

Instead of selecting the same number of schools for each grade-level, we can instead vary how many schools are chosen by providing the map function additional information. In this case, we’ll use map2_df so that we can tell it to use for the value of n the corresponding element in a list we give it. In the example below, we want to select 2 high schools, 3 middle schools, 4 primary schools, and 1 “other” school.

cms_data %>%
  split(.$grade_level_cat) %>%
  map2_df(
    .x = .,
    .y = list("High" = 2, "Middle" = 3, "Other" = 1, "Primary" = 4),
    .f = ~ sample_implicit(.x, n = .y, frl_cat, spg_score)
  ) %>%
  filter(in_sample) %>%
  select(school_name, grade_level_cat)
#> # A tibble: 10 x 2
#>    school_name                  grade_level_cat
#>    <chr>                        <chr>          
#>  1 LEVINE MIDDLE COLLEGE HIGH   High           
#>  2 INDEPENDENCE HIGH            High           
#>  3 SOUTH CHARLOTTE MIDDLE       Middle         
#>  4 J M ALEXANDER MIDDLE         Middle         
#>  5 NORTHEAST MIDDLE             Middle         
#>  6 NORTHWEST SCHOOL OF THE ARTS Other          
#>  7 BALLANTYNE ELEMENTARY        Primary        
#>  8 PARKSIDE ELEMENTARY          Primary        
#>  9 ALBEMARLE ROAD ELEMENTARY    Primary        
#> 10 DAVID COX ROAD ELEMENTARY    Primary

There above examples assume a familiarity with purrr and indeed there are other ways to use sample_implicit with other existing R functions to meet your needs. The point is to simply highlight how you can build upon sample_implicit to accomplish your sampling goals.

Questions?

Contact Adam Maier with questions.