R Tutorials‎ > ‎

Post-Stratification Raking: Questions and Answers about Creating Weights

posted Feb 8, 2017, 6:33 PM by Mia Costa   [ updated Oct 21, 2017, 4:03 PM ]
Last week I received some questions about my methods tutorial on creating post-stratification weights. Problems always arise (at least for me) when trying to apply a bit of code to new data, so it’s sometimes useful to see what problems you may run into and how to avoid them. Below, his questions about the post-stratification raking process (in blue) and my answers. I hope they help illuminate some of the more mundane parts of preparing data for analysis.

After I run all the code in your first post about using weights, the second one on how to create your own weights, and change some of the variable names to match those in the dataset, everything goes smoothly until the last bit of code, where I get this error message:

Error in na.fail.default(list(female = c(0, 1, 1, 1, 1, 0, 1, 0, 0, 1, :
missing values in object

Under the rake command description, it says the argument for sample.margins should not contain NA values, and the original variable for gender does—how do you deal with this? I think the same problem arises for non-gender variables too when they have NA values. Do you just exclude all observations with NA values?

I haven't run into this problem much personally because most of the data I work with has complete demographic information for respondents (putting demos questions at the end of the survey and making them mandatory helps). But yes, I would just exclude NAs. You can't weight on a value that is missing, and in your analysis, you wouldn't want to include unweighted observations in with observations that are weighted to population values.

For my survey, I’m hoping to weight on four different variables:

1) Gender (man/woman)
2) Race/ethnicity (American indian/Alaskan native, Asian, black or African American, Hispanic or Latino, Native Hawaiian or Other Pacific Islander, International, Two or more races, White, Other/not specified)
3) Class year (senior, junior, sophomore, freshman, other)
4) Greek affiliation (yes, no)

Given this, how many times do you think R should go through the raking process (in terms of what I put for the “control = list(maxit) portion of the rake argument)? Is it one for each category (and corresponding indicator variable), making it 18 times in my case?

You can probably leave the entire control argument out, actually — you'll get a warning message if it doesn't converge and you can increase the number of iterations from there. The default number of iterations is 10 and this is usually enough. Four is not that many variables to rake on, though your intuition is correct that it depends on the number of levels in each variables. Other things can affect convergence as well, like if there aren't a lot of observations in some cells.

What do you mean by "increase the number of iterations from there"—is this something the rake function does by itself, or do I have do something additional?

You can start with the default number of 10 iterations, so exclude control = list(maxit)altogether, and then if it doesn't converge, which you will know by getting a warning message that says something along the lines of "Could not converge", you can run the code again and increase the number of iterations until you reach convergence.

How can I transfer the weights to the original data frame in order to easily apply them when calculating weighted means for question responses? At this point they’re in a svy.dat object I don’t really know what to do with.

Also, at some point in your example, you could tell that 3 respondents had a weight in excess of 8—how could you tell the weights each respondent got based? (These questions are probably related.)

The weights are in your survey design object under "weights.” You can pull those out and put them in a new variable in your data frame, and then tabulate the variable.

For example:

dat$weight <- weights(svy.dat) # make a variable that contains your weights for each respondent
names(dat) # double check to make sure it’s there

For some of my variables, I’m giving more options (e.g. “other” in addition to “man”/”woman” for gender) than the population values have (e.g. administrative data only has percentage for man/woman). What would you suggest as a way to deal with minor discrepancies like this?

Has anyone else come up with an estimate of how many people identify as "other" in your population of interest? If so, use that. If not, just make an education guess what the "true" value of "other gender" is in the population.

That’s a good idea. In past surveys I’ve done, non-binary options have come out to around 1 percent. In this case, I’m guessing it’s reasonable to estimate that other = 1%, men = 49.5%, and women = 49.5%, right?

I'm not sure, because I don't know the gender breakdown for your target population ([your school’s] student body, right?). If you were just going to do 50-50 before, this seems reasonable. But if there are more women than men at [your school], for instance, you might want to adjust it based on what you know the female and male student population [at your school] to be.

I also have an issue if I want to use Greek affiliation as a weight: at my school, freshmen can’t join Greek houses, so they’re excluded from the denominator of affiliation percentages. Is there a way to add weights based on percent Greek for all respondents but freshmen?

You could go through the whole raking process twice. The first time exclude freshmen and weight on Greek affiliation only. Then add freshmen back into the dataset and give them the baseline weight of 1. Then do it all again but start from those weights (using the weights argument in svydesign and the new weight variable you created) instead of starting with an unweighted survey object, such as: 

svy.dat.unweighted <- svydesign(ids = ~1, data = dat, weights = weight)

I think you might have implied this, but is it true that I don’t need to create two different indicator weight variables for something like gender (i.e. where they’re complements to 100 percent)? What about in the case of class year—should I create 5 different indicators (senior, junior, sophomore, freshman, other that all add up to 100 percent) in this case?

In my last post, all of the variables I use take on a 0, 1 distribution. For example, instead of having one “race” variable, I had a variable called “white”, where 1 indicates a respondent that is white and 0 indicate a respondent is a race other than white, and another variable called “black” and so on. But you don’t have to create different indicator variables for each level of every variable. For example, you could just use one variable for "gender" and have female/male/other be the different levels. In the case of college class, instead of breaking up freshmen/sophomore/junior into separate variables, you can collapse them into one and specify the population frequencies within each level.

So for example:

class.dist <- data.frame(class = c("freshman", "sophomore", "junior", "senior"),
Freq = nrow(dat) * c(0.25, 0.25, 0.25, 0.25))

Let me know if you would have answered these questions differently or if something needs further clarification!