R Tutorials‎ > ‎

Logit/Probit Models in R

posted Feb 28, 2016, 9:21 AM by Mia Costa   [ updated Mar 14, 2016, 10:56 AM ]
For this example, we'll use the 2008 Cooperative Congressional Election Study. You can download it here from Brian Schaffner's website (http://people.umass.edu/schaffne/cces08.dta). Download the file, read it into R using the foreign package, and recode some of the variables so that they will be appropriately coded: 

setwd("Your Working Directory Here") library(foreign) dat <- read.dta("cces08.dta", convert.factors = F) # convert.factors brings in
the variable values instead of the labels (1 instead of support) 
dat$stemcell <- dat$cc316c 
dat$stemcell[dat$stemcell=="2"] = 0 
dat$stemcell[dat$stemcell=="3"] = NA 
dat$pid <- dat$cc307 dat$pid[dat$pid=="2"] <- -1 
dat$pid[dat$pid=="3"] <- 0 dat$pid[dat$pid >"3"] <- NA
dat$ideology <- dat$v243 dat$ideology[dat$ideology=="6"] <- NA
dat$education <- dat$v213

The glm function in R fits generalized linear models. Specify the model how you normally would using the lm function. Then specify the the type of model to be fit with the family argument. For logistic regression, we’ll use the “binomial” family. See ?family for other types. 

model <- glm(stemcell ~ pid + ideology + education, data=dat, family=binomial()) 

It is important to keep in mind that the coefficients are not easily interpreted in the way that OLS coefficients are. To get odds ratios, exponentiate the model coefficients and confidence intervals: 

exp(cbind(coef(model), confint(model))) 

You can interpret the odds ratios a bit more easily. For example, you could interpret the odds ratio on education to mean that for each additional unit increase on the scale education, an individual’s odds of supporting stem cell research are 1.16 times greater. Or, perhaps slightly more intuitively, for each additional education category, an individual’s odds of supporting stem cell research increase by 16%. You can ignore the intercept for odds ratios.

Now, let’s return to interpreting those coefficients. While odds ratios are fairly easy to calculate, they probably aren’t the best way to talk about the effects of the coefficients. More ideal and easy to understand are predicted probabilities. You can generate and plot these using the predict command.

First, predict will return a list of predicted probabilities under different conditions that you can specify. For example, if you wanted to know the probability that an individual who is “very liberal” on the ideology measure and average on the other covariates would support funding for stem cell research you would first create a new dataframe with those new conditions, and then create a new stemcell variable with the predicted probability of support: 

# create new data frame with ideology at 1 and other variables at means newdata <- with(dat, data.frame(ideology = 1, stemcell = mean(stemcell, na.rm=TRUE), pid=mean(pid, na.rm=TRUE), education=mean(education)))

# get predicted probability of stemcell 
newdata$stemcellP <- predict(model, newdata = newdata, type = "response")


Note that there is a .98 probability that this individual would support funding for stem cell research.

Now what if you want to plot the predicted probabilities across all values of ideology? To do this, recreate the data frame you made above but with ideology at every value between 1 and 5 in 1 point increments. In this new dataset (which I’m calling “fdat”), all the other model variables will be held at their means. # recreate dataframe with ideology at all levels and other variables at means 

fdat <- with(dat, data.frame(ideology = rep(c(1:5), 1), pid=mean(pid, na.rm=TRUE), education=mean(education)))

# get predicted probabilities fdat$stemcellP <- predict(model, newdata = fdat, type = "response")

Note that I stored the predicted probabilities into a new variable (stemcellP) within the dataframe. Now to make a simple plot:

p1 <- ggplot(fdat, aes(x = as.numeric(ideology), y = stemcellP), group=fdat)
p1 + geom_point() + geom_line() + xlab("Ideology") + ylab("Prob. of supporting funding for stem cell research") + ggtitle("Effect of Ideology on Stem Cell Support")

This post is CHAPTER 1 of an R packet I am creating for a Survey Methods graduate seminar taught by Brian Schaffner at the University of Massachusetts Amherst. The seminar is usually only taught in Stata, so I translated all the exercises, assignments, and examples used in the course for R users. Other chapters include: Ordinal Logit/Probit Models, Multinomial Logit Models, Count Models, Using Weights, Creating Post-Stratification Weights, Item Scaling, Matching/Balancing.