R Tutorials‎ > ‎

Count Models in R

posted Mar 14, 2016, 10:50 AM by Mia Costa   [ updated Mar 18, 2016, 4:50 PM ]
To demonstrate how to analyze count dependent variables, we first need to create a dependent variable that is a count. Again use the 2008 CCES survey (http://people.umass.edu/schaffne/cces08.dta). The variable we will create is going to be the number of political activities the respondent reported partaking in. We will use the following variables:

cc403      vote
cc415_1  action - attend local political meetings
cc415_2  action - persuade someone to vote
cc415_3  action - put up political sign
cc415_4  action - work for a candidate or campaign
cc415_5  action - comment on political blog
cc415_6  action - donate money to candidate or campaign

We will need to recode these variables to binary (0 or 1) before adding them up to create our count:
dat$cc403[dat$cc403 < 5 ] = 0
dat$cc403[dat$cc403==5 ] = 1
dat$cc415_1[dat$cc415_1==2 ] = 0
dat$cc415_2[dat$cc415_2==2 ] = 0
dat$cc415_3[dat$cc415_3==2 ] = 0
dat$cc415_4[dat$cc415_4==2 ] = 0
dat$cc415_5[dat$cc415_5==2 ] = 0
dat$cc415_6[dat$cc415_6==2 ] = 0

Now, let’s add these together to create our count of political activities:

dat$activities <- dat$cc403 + dat$cc415_1 + dat$cc415_2 + dat$cc415_3 + dat$cc415_4 + dat$cc415_5 + dat$cc415_6 

We can get a quick sense of how this count variable is distributed by looking at a histogram of it: 

barplot(table(dat$activities))


Note that the modal number of activities engaged in is 2. Often, a count variable will have a very long tail. In this case, however, there was a limit to how many activities one could report engaging in (7), so we do not have to deal with that issue.

Let’s estimate a model predicting how active individuals are in politics. We’ll use the same education variable from above. However, let’s recode the party id and ideology variables to take into account whether someone identifies with a party and how strong their ideology is: 

dat$pid <- dat$cc307
dat$pid[dat$pid=="2"] <- -1
dat$pid[dat$pid=="3"] <- 0
dat$pid[dat$pid >"3"] <- NA
dat$partyidentifier <- dat$pid 
dat$partyidentifier[dat$partyidentifier==-1 ] = 1 

dat$ideology_strength <- NA 
dat$ideology_strength[dat$v243==3 ] = 0 
dat$ideology_strength[dat$v243==2 ] = 1 
dat$ideology_strength[dat$v243==4 ] = 1 
dat$ideology_strength[dat$v243==5 ] = 2
dat$ideology_strength[
dat$v243==1 ] = 2 


Note that this time, I populate the new variable with NA instead of copying the old variable (ideology). Otherwise, I would be recoding over the new values as I ran the following lines of code (2==1, 1==2) – this is always a good idea when you are recoding a lot of numbers into other numbers. Now, let’s estimate a Negative Binomial regression model using the MASS package (the same package we used for the ordinal logit):

library(MASS)
modelNB <- glm.nb(activities ~ education + partyidentifier + ideology_strength,
   data=dat)
summary(modelNB)


Ok, first things first. Note the results at the bottom of the output. R estimates theta (θ) as the dispersion parameter, which is the inverse of the dispersion parameter (alpha, or α) given in other software packages like Stata. Alpha is normally the statistic you’d want to include in a table, so to get alpha, take the inverse of theta. 1/28186 = 0.0000354786

Now to interpreting the results. You cannot interpret the coefficients on their own. However, you can rely on one of two methods. First is to use the “incidence rate ratio” which is just like an odds ratio. 

est <- cbind(coef(modelNB), confint(modelNB))
exp(est)


You can interpret these like odds ratios. For each additional level of education, the rate of activities a respondent engaged in increased by a factor of 1.09, or by 9%.

To get predicted counts of political activity for each level of education, follow the same steps in previous chapters. Use the predict function on the new dataframe with all model variables at the appropriate levels. The last line stores the prediction statistics so that we have standard errors to get confidence intervals for our plot.

fdat <- with(dat, data.frame(education = rep(c(1:6), 1), partyidentifier=mean(partyidentifier, na.rm=TRUE), ideology_strength=mean(ideology_strength, na.rm=TRUE)))
pred.act <- predict(modelNB, newdata = fdat, type = "response")
p.edu <- cbind(fdat, predict(modelNB, newdata = fdat, type = "response", se = TRUE))

Now, let’s plot the relationship between education and political activity. This time I’ll use ggplot’s many plotting options to alter the aesthetics:

p1 <- ggplot(fdat, aes(x = education, y = pred.act), group=fdat) 

p1 + geom_point(size=3, colour="darkgreen")  # change color and size of points 
+ geom_line(colour="darkgreen")  # change color of line 
+  theme_bw() # change theme 
+  scale_x_continuous(breaks=1:6*1) # scale x axis 
+ expand_limits(y=1) # start y at 1 
+ geom_errorbar(aes(ymax = p.edu$fit + p.edu$se.fit*1.96, ymin=p.edu$fit - p.edu$se.fit*1.96), width=0.07, colour="darkgreen") # add confidence interval bars 
+  ylab("Predicted number of political activities")


This post is CHAPTER 4 of an R packet I am creating for a Survey Methods graduate seminar taught by Brian Schaffner at the University of Massachusetts Amherst. The seminar is usually only taught in Stata, so I translated all the exercises, assignments, and examples used in the course for R users. Other chapters include: Logit/Probit Models, Ordinal Logit/Probit Models, Multinomial Logit Models, Using Weights, Creating Post-Stratification Weights, Item Scaling, Matching/Balancing.


Comments