Sampling for Common Metrics
sample_implicit.Rmd
In most TNTP projects, we collect data on common metrics to understand the extent to which students, families, teachers, or others in an entire school, district, network, or state, have access to a valuable resource. It’s often not practical to collect all the possible data – we can’t collect every assignment students in a school receive over an entire year, we can’t observe every lesson, we can’t survey every teacher – so we gather data on the common metrics for a subset of classrooms, teachers, etc. From whom or from which classrooms data is collected is consequential; if we do not do so randomly, the common metric data might not be representative of the broader population on which we’re hoping to make inferences.
Yet, if we choose participants randomly, we risk getting a random sample of data that does not contain enough variation to make equity or group comparisons. For example, if we randomly choose 5 schools from a district of 100, there is a chance we randomly choose primary schools only, or get five schools that look similarly demographically. It would be better if we could randomly pick schools while at the same time maximizing the chance the schools vary on the characteristics that matter the most to us and the outcome(s) we’re measuring.
This is exactly what sampling with implicit stratification does, and it’s why serious educational research organizations like NCES use it often. The tntpmetrics
package contains tools for you to easily draw your own samples with implicit stratification. This will help ensure the common metric data you collect is representative of your target population and will allow you to compare results across equity groups.
Practice data: cms_data
To demonstrate how to apply the implicit stratification sampling functions in tntpmetrics
, we will use school-level data on all public schools in the Charlotte-Mecklenberg School district in the 2018-2019 school year. This data, cms_data
, has already been cleaned and processed as part of an old Academic Diagnostic contract. This cleaning includes filling in (or imputing) missing values for some newer schools with the district mean so that the data has no missingness.
Included in this data is a categorical variable indicating grade-levels served (Primary, Middle, High, or Other), the proportion of students receiving free or reduced-price lunch, the proportion of students of color, the state school accountability score (spg_score
) and letter grade(spg_grade
), and the number of enrolled students.
head(cms_data)
#> # A tibble: 6 x 7
#> school_name grade_level_cat frl_percent soc_percent spg_score spg_grade
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 EASTOVER ELEMENTA~ Primary 0.248 0.323 66 C
#> 2 MCALPINE ELEMENTA~ Primary 0.229 0.348 67 C
#> 3 BARNETTE ELEMENTA~ Primary 0.239 0.355 75 B
#> 4 TRILLIUM SPRINGS ~ Primary 0.211 0.629 76 B
#> 5 TORRENCE CREEK EL~ Primary 0.206 0.339 77 B
#> 6 BALLANTYNE ELEMEN~ Primary 0.152 0.477 79 B
#> # ... with 1 more variable: total_students_2018 <dbl>
How does implicit stratification work? It’s mostly just sorting data
To learn more about the specifics of implicit stratification, read the TNTP memo “placeholder_memo_title” (link to Adam’s, potentially revised, memo). In short, we sort the by the variable(s) on which we want to ensure variation, pick a random starting spot on this sorted list, and then count every k rows to get all of our selected units. The number of steps k represents is based on how many units are sampled. For example, there are 170 schools in the CMS data set. To sample 10 schools, pick a random starting row between 1 and 17, and then count every 17th school until the end of the data. Each school picked along the way is part of the sample.
Importantly, because we first sorted the data on the characteristics we care about, we ensured that schools are the most spread out they could be on these variables – i.e., the sort ensures that schools at the top of the data set look different on these variables than schools on the bottom. Thus, when we walk through the data, counting every 17th school, it’s unlikely we get a sample of schools that all look the same on these characteristics. In fact, the only time this could happen is if schools in the district don’t actually vary much on the characteristics we specify.
To be concrete, let’s manually draw an implicitly stratified sample of 10 schools from the cms_data
, where we want to ensure schools vary on the percent of students of color.
# Sort on soc_percent
sorted_data <- cms_data %>%
arrange(soc_percent)
# Pick a random number between 1 and 17, and then count every 17 integers ten times
set.seed(1)
random_start <- sample(1:17, 1)
selected_rows <- seq(from = random_start, by = 17, length.out = 10)
# Keep the rows corresponding to the selected rows
implicit_sampled_data <- sorted_data %>%
slice(selected_rows)
Let’s see how much variability we have on soc_percent
in our sampled data, compared to just randomly selecting 10 schools without implicit stratification.
# Randomly pick 10 schools without stratification
set.seed(1)
nonimplicit_sampled_data <- cms_data %>%
sample_n(10)
# Use Standard Deviation to compare variability on percent_soc
sd(implicit_sampled_data$soc_percent)
#> [1] 0.2728191
sd(nonimplicit_sampled_data$soc_percent)
#> [1] 0.2947937
The standard deviation of soc_percent
in the implicit sample is higher than in the simple random sample.
Sorting on more than one variable: Serpentine sorting
Drawing an implicit stratified sample is easy with just one characteristic of interest. We just did it manually above! It might seem just as easy with two characteristics of interest, as we could just sort on both variables. But look what happens when we sort the cms_data
by grade-level and proportion of students receiving FRL.
cms_data %>%
select(school_name, grade_level_cat, frl_percent, spg_score) %>%
arrange(grade_level_cat, frl_percent) %>%
slice(24:29)
#> # A tibble: 6 x 4
#> school_name grade_level_cat frl_percent spg_score
#> <chr> <chr> <dbl> <dbl>
#> 1 WEST MECKLENBURG HIGH High 0.998 54
#> 2 HARDING UNIVERSITY HIGH High 0.998 52
#> 3 GARINGER HIGH High 0.999 59
#> 4 JAY M ROBINSON MIDDLE Middle 0.0994 92
#> 5 SOUTH CHARLOTTE MIDDLE Middle 0.135 86
#> 6 COMMUNITY HOUSE MIDDLE Middle 0.166 85
The data is first sorted (alphabetically) on grade-level, and then within each grade-level the data is sorted in ascending order on the proportion of students receiving FRL. The code above shows where in the data it goes from high schools to middle schools. Notice how the proportion of FRL students makes a huge jump from Gariner High to Jay M Robinson Middle. Once the data gets to the next grade-level category, it must start over again from the lowest values of frl_percent
. This leads to large differences in concurrent rows at these junctures. Though these schools already differ by grade-level, this disconnect means they’re very different on other key factors, too. Not only do they differ on their FRL proportion, but because this variable is linked to other characteristics, we see that they’re also very different on something like school achievement. These disconnected junctures in the sort are magnified if we sort by even more than two variables.
Instead, we want concurrent rows of the data to be as similar as possible on the characteristics of interest. This is what helps us obtain the most variability possible in our sample. To do this, we must use a sorting approach known as serpentine, where instead of always starting over with an ascending sort at each juncture, we alternate between ascending and descending sorts to ensure the juncture keeps the values as similar as possible. tntpmetrics
has a built in function to perform serpentine sorts. Let’s apply it to the CMS data and look at the same juncture as above.
cms_data %>%
select(school_name, grade_level_cat, frl_percent, spg_score) %>%
serpentine(grade_level_cat, frl_percent) %>%
slice(24:29)
#> # A tibble: 6 x 4
#> school_name grade_level_cat frl_percent spg_score
#> <chr> <chr> <dbl> <dbl>
#> 1 WEST MECKLENBURG HIGH High 0.998 54
#> 2 HARDING UNIVERSITY HIGH High 0.998 52
#> 3 GARINGER HIGH High 0.999 59
#> 4 ALBEMARLE ROAD MIDDLE Middle 0.997 50
#> 5 RANSON MIDDLE Middle 0.997 42
#> 6 MCCLINTOCK MIDDLE Middle 0.997 55
Now the first middle school in the list is the one with the highest proportion of students receiving FRL. The implicit stratification sample function in tntpmetrics
uses this serpentine sort, so you do not need to use the function serpentine
directly, though it’s available to you as part of the package if you need it for other purposes, or you need to modify the default sampling approach. It works much like dplyr::arrange
: simply list the variables you want to serpentine sort.
The order of sorting variables matters
Even with serpentine sorting, you will get a differently sorted list if you first sort on grade-level and then FRL percent compared to first sorting on FRL percent and then grade-level. The reasons are obvious, but it’s important to point out. In general, it’s best to first sort on the variables that are most important to you, as this will be the variable on which your data is most spread out. Then, choose variables in order from most to least important. If you sort on many variables, the last variables you include will have a smaller effect on the position, as much of the data order is already “locked in”.
Making sorts categorical
What if we wanted to ensure variability on both percent FRL and SPG Score in a school? The obvious answer is to just include them both in a sort.
cms_data %>%
select(school_name, grade_level_cat, frl_percent, spg_score) %>%
serpentine(frl_percent, spg_score) %>%
slice(1:10)
#> # A tibble: 10 x 4
#> school_name grade_level_cat frl_percent spg_score
#> <chr> <chr> <dbl> <dbl>
#> 1 PROVIDENCE SPRING ELEMENTARY Primary 0.0422 90
#> 2 ELON PARK ELEMENTARY Primary 0.0624 88
#> 3 POLO RIDGE ELEMENTARY Primary 0.0655 88
#> 4 ELIZABETH LANE ELEM Primary 0.094 83
#> 5 JAY M ROBINSON MIDDLE Middle 0.0994 92
#> 6 ARDREY KELL HIGH High 0.103 92
#> 7 PROVIDENCE HIGH High 0.104 93
#> 8 HAWK RIDGE ELEMENTARY Primary 0.112 86
#> 9 PARK ROAD MONTESSORI Primary 0.112 86
#> 10 SOUTH CHARLOTTE MIDDLE Middle 0.135 86
Notice how the SPG score barely seems sorted at all. That is because the data is first sorted on FRL percent, and because this variable is nearly unique to each school: once the data is sorted on it, there is very little to no room left to sort anything else.
When you want to include multiple non-categorical variables in your implicit stratification, you have two options:
- Don’t do it; just pick one non categorical variable. This approach might be worthwhile if the two variables of interest are closely related. In these cases, adding the second variable doesn’t add much to the stratification. But if the variables are not closely related and it’s not not critically important to keep one of the variables as is, your best option is to
- Make the variables categorical. This is the more common approach as it allows you to keep both variables in the stratification process.
In the CMS example, let’s turn percent FRL and percent SOC into categorical variables.
cms_data %<>%
mutate(
frl_cat = case_when(
frl_percent < 0.25 ~ "< 25%",
frl_percent < 0.50 ~ "25-50%",
frl_percent < 0.75 ~ "50-75%",
frl_percent <= 1.00 ~ "> 75%"
),
frl_cat = ordered(frl_cat, levels = c("< 25%", "25-50%", "50-75%", "> 75%")),
soc_cat = case_when(
soc_percent < 0.25 ~ "< 25%",
soc_percent <= 0.50 ~ "25-50%",
soc_percent <= 0.75 ~ "50-75%",
soc_percent <= 1.00 ~ "> 75%"
),
soc_cat = ordered(soc_cat, levels = c("< 25%", "25-50%", "50-75%", "> 75%"))
)
We can then try the sort again.
cms_data %>%
select(school_name, grade_level_cat, frl_cat, spg_score) %>%
serpentine(frl_cat, spg_score) %>%
slice(1:10)
#> # A tibble: 10 x 4
#> school_name grade_level_cat frl_cat spg_score
#> <chr> <chr> <ord> <dbl>
#> 1 EASTOVER ELEMENTARY Primary < 25% 66
#> 2 MCALPINE ELEMENTARY Primary < 25% 67
#> 3 BARNETTE ELEMENTARY Primary < 25% 75
#> 4 TRILLIUM SPRINGS MONTESSORI Primary < 25% 76
#> 5 DAVIDSON ELEMENTARY Primary < 25% 76
#> 6 TORRENCE CREEK ELEMENTARY Primary < 25% 77
#> 7 SHARON ELEMENTARY Primary < 25% 79
#> 8 BALLANTYNE ELEMENTARY Primary < 25% 79
#> 9 OLDE PROVIDENCE ELEMENTARY Primary < 25% 80
#> 10 J.V. WASHAM ELEMENTARY Primary < 25% 81
There, things look better. It’s okay to implicitly stratify by a non-categorical variable, but it usually should be your last variable listed, and in most cases you should only have one of these types of variables.
Drawing an implicit sample
With an understanding of how implicitly stratified sampling works, actually drawing the sample is straightforward. Use the sample_implicit
function on your data and indicate how many rows you want sampled. You do not need to sort the data ahead of time with this function, as it contains a place to indicate the variables on which you want to implicitly stratify. After running this function, you will get your same data back (sorted appropriately) with a new variable called in_sample
that is TRUE if the row was selected and FALSE if not.
For example, let’s sample 20 CMS schools after implicitly stratifying on grade-level, proportion FRL, proportion SOC, and the SPG score.
cms_sample <- sample_implicit(
data = cms_data,
n = 20,
grade_level_cat, frl_cat, soc_cat, spg_score
)
cms_sample %>%
select(school_name, in_sample) %>%
slice(1:10)
#> # A tibble: 10 x 2
#> school_name in_sample
#> <chr> <lgl>
#> 1 WILLIAM AMOS HOUGH HIGH FALSE
#> 2 LEVINE MIDDLE COLLEGE HIGH FALSE
#> 3 PROVIDENCE HIGH TRUE
#> 4 ARDREY KELL HIGH FALSE
#> 5 OLYMPIC HIGH FALSE
#> 6 MALLARD CREEK HIGH FALSE
#> 7 CATO MIDDLE COLLEGE HIGH FALSE
#> 8 SOUTH MECKLENBURG HIGH FALSE
#> 9 BUTLER HIGH FALSE
#> 10 HOPEWELL HIGH FALSE
To get just the 20 sampled schools, you can filter on in_sample
cms_sample %>%
filter(in_sample == T)
#> # A tibble: 20 x 10
#> school_name grade_level_cat frl_percent soc_percent spg_score spg_grade
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 PROVIDENCE HIGH High 0.104 0.275 93 A
#> 2 MYERS PARK HIGH High 0.308 0.396 83 B
#> 3 CHARLOTTE TEACHE~ High 0.606 0.819 77 B
#> 4 JAMES MARTIN MID~ Middle 0.987 0.962 37 F
#> 5 NORTHRIDGE MIDDLE Middle 0.996 0.953 58 C
#> 6 FRANCIS BRADLEY ~ Middle 0.352 0.471 73 B
#> 7 BAILEY MIDDLE Middle 0.189 0.248 87 A
#> 8 LAWRENCE ORR ELE~ Primary 0.996 0.980 69 C
#> 9 MERRY OAKS INTER~ Primary 0.973 0.979 62 C
#> 10 HUNTINGTOWNE FAR~ Primary 0.987 0.914 59 C
#> 11 HIDDEN VALLEY EL~ Primary 0.993 0.986 55 C
#> 12 GOVERNORS VILLAG~ Primary 0.996 0.969 52 D
#> 13 OAKDALE ELEMENTA~ Primary 0.995 0.956 46 D
#> 14 ALLENBROOK ELEME~ Primary 0.994 0.964 36 F
#> 15 ELIZABETH TRADIT~ Primary 0.5 0.782 70 B
#> 16 MATTHEWS ELEMENT~ Primary 0.340 0.381 75 B
#> 17 MYERS PARK TRADI~ Primary 0.404 0.637 70 B
#> 18 IRWIN ACADEMIC C~ Primary 0.256 0.763 83 B
#> 19 BAIN ELEMENTARY Primary 0.226 0.299 83 B
#> 20 PROVIDENCE SPRIN~ Primary 0.0422 0.301 90 A
#> # ... with 4 more variables: total_students_2018 <dbl>, frl_cat <ord>,
#> # soc_cat <ord>, in_sample <lgl>
This makes comparing sampled schools to non-sampled schools easy.
cms_sample %>%
group_by(in_sample) %>%
summarize(mean_frl= mean(frl_percent))
#> # A tibble: 2 x 2
#> in_sample mean_frl
#> <lgl> <dbl>
#> 1 FALSE 0.647
#> 2 TRUE 0.612
Accounting for size differences
The implicit stratification approach outlined above treats each row of data equally. In many cases, that makes sense, for example if each row was a teacher or a classroom. However, in some cases each row of data might represent a substantially different number of subunits, like classrooms or students. This is often the case when sampling schools. Treating each school equally means that students in small schools are more likely to be studied than students in large schools because the former’s school is equally likely to be chosen as the latter, but once the school is chosen students in small schools are more likely to have one of their classrooms visited given the fewer options there.
sample_implicit
includes an option to specify a variable in your data representing a measure of size. It will then account for this size and choose schools with a probability proportional to its size.
In the CMS data, we can use the total number of students in the schools as a measure of size and account for this when we draw the sample
cms_sample_withsize <- sample_implicit(
data = cms_data,
n = 20,
grade_level_cat, frl_cat, soc_cat, spg_score,
size_var = total_students_2018
)
cms_sample_withsize %>%
filter(in_sample == T)
#> # A tibble: 20 x 10
#> school_name grade_level_cat frl_percent soc_percent spg_score spg_grade
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 LEVINE MIDDLE CO~ High 0.205 0.426 99 A
#> 2 SOUTH MECKLENBUR~ High 0.432 0.615 79 B
#> 3 PERFORMANCE LEAR~ High 0.609 0.742 45 D
#> 4 VANCE HIGH High 0.990 0.966 61 C
#> 5 JAMES MARTIN MID~ Middle 0.987 0.962 37 F
#> 6 MCCLINTOCK MIDDLE Middle 0.997 0.882 55 C
#> 7 ALEXANDER GRAHAM~ Middle 0.309 0.404 68 C
#> 8 SOUTH CHARLOTTE ~ Middle 0.135 0.294 86 A
#> 9 COCHRANE COLLEGI~ Other 0.996 0.978 38 F
#> 10 JOSEPH W GRIER A~ Primary 0.999 0.958 64 C
#> 11 RIVER OAKS ACADE~ Primary 0.995 0.873 59 C
#> 12 J H GUNN ELEMENT~ Primary 0.996 0.895 56 C
#> 13 GOVERNORS VILLAG~ Primary 0.996 0.969 52 D
#> 14 HORNETS NEST ELE~ Primary 0.995 0.970 46 D
#> 15 VAUGHAN ACADEMY ~ Primary 0.606 0.819 59 C
#> 16 LONG CREEK ELEME~ Primary 0.530 0.728 62 C
#> 17 ENDHAVEN ELEMENT~ Primary 0.286 0.503 75 B
#> 18 MALLARD CREEK EL~ Primary 0.479 0.899 59 C
#> 19 ELIZABETH LANE E~ Primary 0.094 0.27 83 B
#> 20 PROVIDENCE SPRIN~ Primary 0.0422 0.301 90 A
#> # ... with 4 more variables: total_students_2018 <dbl>, frl_cat <ord>,
#> # soc_cat <ord>, in_sample <lgl>
Notice how incorporating size increased the number of secondary schools in the sample. Because these schools typically contain more students, including size increased the chance they’d be chosen.
Other sampling approaches
Implicit stratified sampling is not the only technique. But rather than create functions to account for all sampling situations, many other approaches can be accomplished with existing R functions and packages, or by combining them with sample_implicit
.
Simple random sampling
If you don’t need or want to select a stratified sample, it’s easiest to use already available functions. For a simple random sample – i.e., the equivalent of throwing each row of data into a hat and selecting n
of them at random – I recommend sample_n
from dplyr
. This is the approach used earlier in the vignette to choose 10 random schools
set.seed(123)
cms_data %>%
sample_n(10)
#> # A tibble: 10 x 9
#> school_name grade_level_cat frl_percent soc_percent spg_score spg_grade
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 MALLARD CREEK HI~ High 0.455 0.843 75 B
#> 2 PARK ROAD MONTES~ Primary 0.112 0.319 86 A
#> 3 REEDY CREEK ELEM~ Primary 0.721 0.910 63 C
#> 4 ALBEMARLE ROAD M~ Middle 0.997 0.936 50 D
#> 5 HUNTERSVILLE ELE~ Primary 0.305 0.349 83 B
#> 6 CHARLOTTE ENGINE~ Other 0.454 0.784 83 B
#> 7 COCHRANE COLLEGI~ Other 0.996 0.978 38 F
#> 8 PERFORMANCE LEAR~ High 0.609 0.742 45 D
#> 9 NATIONS FORD ELE~ Primary 0.996 0.979 59 C
#> 10 RIVER OAKS ACADE~ Primary 0.995 0.873 59 C
#> # ... with 3 more variables: total_students_2018 <dbl>, frl_cat <ord>,
#> # soc_cat <ord>
Explicitly stratifying
Explicitly stratifying a sample typically means selecting a fixed number of units for explicit groups (or strata). For example, if I wanted to ensure I selected 3 schools of each grade-level type, I’d first split the data by grade-type, then randomly select schools in each group. Again, it’s easiest to implement this with already available tools: by combining sample_n
and group_by
from dplyr.
set.seed(123)
cms_data %>%
group_by(grade_level_cat) %>%
sample_n(3)
#> # A tibble: 12 x 9
#> # Groups: grade_level_cat [4]
#> school_name grade_level_cat frl_percent soc_percent spg_score spg_grade
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 NORTH MECKLENBUR~ High 0.536 0.821 61 C
#> 2 MALLARD CREEK HI~ High 0.455 0.843 75 B
#> 3 ROCKY RIVER HIGH High 0.636 0.923 53 D
#> 4 EASTWAY MIDDLE Middle 0.997 0.962 40 D
#> 5 MCCLINTOCK MIDDLE Middle 0.997 0.882 55 C
#> 6 SOUTHWEST MIDDLE~ Middle 0.545 0.770 49 D
#> 7 HAWTHORNE ACADEM~ Other 0.974 0.926 52 C
#> 8 CHARLOTTE ENGINE~ Other 0.454 0.784 83 B
#> 9 COCHRANE COLLEGI~ Other 0.996 0.978 38 F
#> 10 PARK ROAD MONTES~ Primary 0.112 0.319 86 A
#> 11 MOUNTAIN ISLAND ~ Primary 0.584 0.718 59 C
#> 12 NATIONS FORD ELE~ Primary 0.996 0.979 59 C
#> # ... with 3 more variables: total_students_2018 <dbl>, frl_cat <ord>,
#> # soc_cat <ord>
This approach guarantees a specific number of schools for each strata. This differs from implicitly stratifying, where one will get a sample that have units (e.g., schools) from the different strata, but there is no preordained or set number of units selected from each strata - in fact, if some strata contains few units then there there is a chance zero units from some strata will be chosen. Explicitly stratifying the data ensures each stratum is represented. This could be good or bad: if you force a unit to be chosen from each stratum, then the data may no longer be representative if the strata vary wildly in size. One would need to apply weights to the collected data to get back to representative values, which is a process seldom used at TNTP. At the same time, forcing a unit to be chosen from each stratum ensures no group is excluded, even if the data taken together is no longer representative of the overall population. Which approach to use depends on the needs of the project. If you’re looking to get data that is representative of an entire school, district, network, etc. implicitly stratifying the data will likely get you there more easily.
Combining explicit and implicit stratification using purrr::map
Using existing functions, we can also combine explicit and implicit stratification. For example, what if we wanted three schools from each grade-level type, but within each grade-level we did not want to pick schools randomly; instead we wanted to implicitly stratify on the proportion of students receiving FRL and the SPG score. We can combine both stratification approaches with the help of the map
function from the purrr
package.
To implement, we split the data by grade-level type, then apply the sample_implicit
function separately to each grade-type. We’ll use the map_df
function to combine the results back into a single data.frame all in one step.
library(purrr)
cms_data %>%
split(.$grade_level_cat) %>%
map_df(~ sample_implicit(.x, n = 3, frl_cat, spg_score)) %>%
filter(in_sample) %>%
select(school_name, grade_level_cat)
#> # A tibble: 12 x 2
#> school_name grade_level_cat
#> <chr> <chr>
#> 1 PROVIDENCE HIGH High
#> 2 HOPEWELL HIGH High
#> 3 CHARLOTTE TEACHER EARLY COLLEGE- UNCC High
#> 4 SOUTH CHARLOTTE MIDDLE Middle
#> 5 J M ALEXANDER MIDDLE Middle
#> 6 NORTHEAST MIDDLE Middle
#> 7 CHARLOTTE ENGINEERING EARLY COLLEGE-UNCC Other
#> 8 NORTHWEST SCHOOL OF THE ARTS Other
#> 9 COCHRANE COLLEGIATE ACADEMY Other
#> 10 J.V. WASHAM ELEMENTARY Primary
#> 11 REEDY CREEK ELEMENTARY Primary
#> 12 HIGHLAND RENAISSANCE ACADEMY Primary
Instead of selecting the same number of schools for each grade-level, we can instead vary how many schools are chosen by providing the map function additional information. In this case, we’ll use map2_df
so that we can tell it to use for the value of n
the corresponding element in a list we give it. In the example below, we want to select 2 high schools, 3 middle schools, 4 primary schools, and 1 “other” school.
cms_data %>%
split(.$grade_level_cat) %>%
map2_df(
.x = .,
.y = list("High" = 2, "Middle" = 3, "Other" = 1, "Primary" = 4),
.f = ~ sample_implicit(.x, n = .y, frl_cat, spg_score)
) %>%
filter(in_sample) %>%
select(school_name, grade_level_cat)
#> # A tibble: 10 x 2
#> school_name grade_level_cat
#> <chr> <chr>
#> 1 LEVINE MIDDLE COLLEGE HIGH High
#> 2 INDEPENDENCE HIGH High
#> 3 SOUTH CHARLOTTE MIDDLE Middle
#> 4 J M ALEXANDER MIDDLE Middle
#> 5 NORTHEAST MIDDLE Middle
#> 6 NORTHWEST SCHOOL OF THE ARTS Other
#> 7 BALLANTYNE ELEMENTARY Primary
#> 8 PARKSIDE ELEMENTARY Primary
#> 9 ALBEMARLE ROAD ELEMENTARY Primary
#> 10 DAVID COX ROAD ELEMENTARY Primary
There above examples assume a familiarity with purrr
and indeed there are other ways to use sample_implicit
with other existing R functions to meet your needs. The point is to simply highlight how you can build upon sample_implicit
to accomplish your sampling goals.