Open data: the General Social Survey (40 years of explorable data about American society)

Every other year since 1972, the General Social Survey has asked thousands of Americans 90 minutes of questions about religion, culture, beliefs, sex, politics, family, and a lot more. The resulting dataset has been cited by more than 14,000 academic papers, books, and dissertations. And it’s routinely cited in the news.

We’ve loaded it into Statwing, our easy-to-use data analysis tool, so anyone can explore it ridiculously easily.

Try it out.


You should really just try it out, but if you’re still here, you can see some examples of GSS analyses in Statwing about marijuana legalization, political party identification, and attitudes about education funding.

Or, here’s another example analysis: Are Americans today better educated than Americans 40 years ago?

To answer that in Statwing you’d click “Year”, “Highest grade completed”, then Statwing’s “Relate” function. Then filter the data down to only 1972 and 2012, and ages 30 to 40, and get these results:

Time vs. Education

This image is static, but in Statwing you can hover to investigate, as per the blue box shown above.

You’d also get statistical output in the form of this sentence:


This basic statistical output:

On average, 30-40 year-olds now have 1.5 more years of education than 30-40 year-olds in 1972.

And, if you wanted to dive deeper, you can also see the advanced statistical output:

Screen Shot 2014-01-16 at 4.26.17 PM

If you’d prefer to play with the dataset in other software, here it is for download.

To understand the dataset quickly, you can scroll through this workspace in Statwing, where we’ve summarized all 406 variables using Statwing’s “Describe” function (don’t use this link for your own analysis, your work won’t save).


Methodological notes:

  • Short version: the GSS is very sound, there’s no reason to be concerned about the robustness or validity of its results. Long version:
  • Many of the questions in the GSS were asked for only a few years. We’ve only selected variables that were asked in 2012 (though for most of those questions the data goes back quite some ways). We also left out a very small number of 2012 variables that seemed to require an undue amount of work to organize. The full list of variables available is in the GSS Codebook (36mb), and you can bulk download data here.
  • Actually the dataset was annual for a long time, then became biennial in 1994; to give equal spacing to charts over time, we’ve included only every other year going back all the way to 1972. We only used every other year of the data; the only exception is that there was no GSS in 1992, so we also included 1993’s data. Using every other year probably sacrificed a very small amount of statistical power, but all findings would be essentially the same.
  • For most of the GSS, race was assessed by the interviewer as “Black”, “White”, or “Other”. Beginning in 2002, the GSS started using the Census version of the question, which allows respondents to select multiple races from a much wider list, and considers “Hispanic” to be an ethnicity, not a race (so Hispanics also choose “White”, “Black”, etc.). The GSS imputed the value for the original version of the question, so that bucket goes back to 1972. The Census version of the question is a bit unwieldy, so for practical purposes we condensed it into one variable, where anyone with multiple ethnicities is labelled “(Mixed)”, and “Hispanic” is considered to be a race. We feel a bit uncomfortable with taking that leap, so we apologize if that seems inappropriate to anyone.
  • The variable “Age” has a “89 or older” bucket; we condensed all of those answers into 89 so that the data could be analyzed as numbers. Similarly, we condensed the “8 or more” bucket for number of children down into 8.
  • Income-related variables are imputed by the GSS, because they were previously asked in bands like “$50k to $60k” and because of adjustments for inflation. That’s why the distributions of those variables will look a little odd. All income numbers are in 2014 dollars.
  • The survey randomly selects households, not individuals. So individuals who live in households with many people are very slightly unrepresented. This doesn’t have much practical impact.
  • The dataset slightly overrepresented women, so we filtered out a randomly selected set of cases so that every year’s Male:Female ratio was near .97:1, as per the American Community Survey; the 1982 survey oversampled black respondents, so we also randomly eliminated some of those cases.
  • We occasionally recoded variables into new, condensed versions. For example, the GSS response options for the liberal to conservative spectrum were “Extremely liberal”, “Liberal”, “Slightly liberal”, “Moderate”, “Slightly conservative”, and so on. We condensed them because this often makes charts and findings more clear and easy to explain.
  • In sum, you can’t say with 100% confidence “X% of Americans believe Y”, but you can get very close, and the trends and relationships are valid even if the absolute numbers are slightly different from a perfect poll. Of course, no poll is 100% perfect anyway.

Sorry, comments are closed for this post.