R Tutorials‎ > ‎

Creating Post-Stratification Weights in R

posted May 2, 2016, 4:56 PM by Mia Costa   [ updated Oct 21, 2017, 4:03 PM ]
My last post went over how to use pre-specified variable in a dataset to weight survey data. But what if you want to create the weights yourself? If you know the population values of demographics that you wish to weight on, you can create survey weights using an approach known as post-stratification raking.

As an example, we will use the 2014 Massachusetts Exit Poll data (http://people.umass.edu/schaffne/mass_exitpoll_2014.dta). The dataset already has a sampling weight included, to adjust for the stratified cluster sample approach taken. However, because of unequal response rates among different demographic groups, we also need to do additional post-stratification weighting. Here is what we know about the demographic composition of the electorate in 2014 according to voting records:

We can use this information to produce post-stratification weights for the survey. First, the easiest thing to do is create indicator variables for each category we will weight on. If you are uninterested in this process, you can skip this next bit of code to get right to the survey weighting. Otherwise, we can recode as follows:

dat$female <- NA
dat$female[dat$gender=="Female" ] = 1
dat$female[dat$gender=="Male" ] = 0
dat$white <- NA
dat$white[dat$race=="White" ] = 1
dat$white[dat$race!="White" ] = 0
dat$black <- NA
dat$black[dat$race=="Black" ] = 1
dat$black[dat$race!="Black" ] = 0
dat$hispanic <- NA
dat$hispanic[dat$race=="Hispanic/Latino" ] = 1
dat$hispanic[dat$race!="Hispanic/Latino" ] = 0
dat$age18_29 <- NA
dat$age18_29[dat$age=="18-29"] = 1
dat$age18_29[dat$age!="18-29"] = 0
dat$age65_over <- NA
dat$age65_over[dat$age=="65 or over"] = 1
dat$age65_over[dat$age!="65 or over"] = 0

Now, before we weight to the population values of these variables, we create a survey design object (like in my previous post) but without specifying any weights. Note that if you want to re-weight starting from some other set of weights –for example, if the sample is already weighted to account for the stratified cluster sample– you can specify weights like we did in the last example and it will calculate the new ones based on that weight variable. Here, we will not use this argument to create an unweighted survey design object. You should get a warning message following the command telling you that you did not supply any weights or probabilities (this is what you want).

svy.dat.unweighted <- svydesign(ids = ~1, data = dat)

Now, we can use the rake command to weight to the population values. To do this, we first set up data frames that specify how often each level occurs in the population: 

# create dataframes based on the population values 
female.dist <- data.frame(female = c("0", "1"), 
                                    Freq = nrow(dat) * c(0.47, 0.53))
white.dist <- data.frame(white = c("0", "1"),
                                    Freq = nrow(dat) * c(0.12, 0.88)) 
black.dist <- data.frame(black = c("0", "1"),
                                    Freq = nrow(dat) * c(0.96, 0.04)) 
hispanic.dist <- data.frame(hispanic = c("0", "1"),
                                    Freq = nrow(dat) * c(0.95, 0.05)) 
age18.dist <- data.frame(age18_29 = c("0", "1"),
                                    Freq = nrow(dat) * c(0.93, 0.07)) 
age65.dist <- data.frame(age65_over = c("0", "1"),
                                    Freq = nrow(dat) * c(0.70, 0.30))

The first vector in each data frame describes the levels of the associated factor variable (0 and 1 in our cases). The second vector describes the corresponding frequencies. For example, in the female distribution data frame, I multiply the known frequency for each level by the number rows in the data set we compute the weights for to get absolute frequencies – that is, the number of observations that should be female based on the given frequency. I specify the frequency for each level as .47 and .53 because on the variable female, we are looking for 47% to be 0 (male) and 53% to be 1 (female). In the next data frame, the frequencies are .12 and .88, because we are looking for 12% to take on a 0 for the white variable (indicating non-white) and 88% to take on a value of 1 (indicating white). And so on...

Next, we combine the unweighted survey design object, the data frames with the popu- lation distributions just created, and the marginal distributions of our sample:

svy.dat <- rake(design = svy.dat.unweighted, 
                sample.margins = list(~female, ~white, ~black, ~hispanic, ~ age18_29, ~age65_over), 
                population.margins = list(female.dist, white.dist, black.dist, hispanic.dist, age18.dist, age65.dist), 
                control = list( maxit = 25))

control=list(maxit) is simply the maximum number of times that R will go through the raking process. It probably makes sense to set this at least to 10, but you may want to set it higher when you are weighting on more variables.

Once you execute this command, weights are added to your survey design object. By asking for a summary, we can see that the mean is 1 (mean should always be 1) and it ranges from .49 to 12.55.  


Often, pollsters trim their weights to ensure that no single respondent receives too much influence over the point estimates. Only 3 respondents receive a weight in excess of 8, so we might wish to replace their weights with 8, so that nobody receives a weight in excess of 8. We can do this by setting the lower and upper limits of the weights with trimWeights. Set strict to TRUE or FALSE to tell R whether or not to redistribute the excess weight among the observations that were not trimmed. This reallocation can push some weights over the upper limit, so I will suppress it here:
svy.dat <- trimWeights(svy.dat, lower=.4, upper=8, strict=TRUE)