The Easiest Data Analysis Mistake to Make

Consider 4 datasets, each with an x and a y variable:

Dataset Mean of x Variance of x Mean of y Variance of y Correlation
(Pearson’s r)
1 9.00 11.00 7.50 4.12 0.816
2 9.00 11.00 7.50 4.12 0.816
3 9.00 11.00 7.50 4.12 0.816
4 9.00 11.00 7.50 4.12 0.816

Can you say with confidence that these datasets are essentially the same? Do you feel like you have a good sense of them?

You shouldn’t. Here’s those same 4 datasets (known as Anscombe’s Quartet) presented visually:

Anscombe's Quartet 1

 

Anscombe's Quartet 2

 


Anscombe's Quartet 3

 

Anscombe's Quartet 4

These are four very different datasets with very similar summary statistics. And if you made the very common mistake of analyzing without visualizing, you’d dramatically misunderstand them.

 

Here’s that data for download and here’s the data in Statwing, our easy-to-use data analysis tool. 

Discussion on Hacker News

Plug: Statwing automatically visualizes your data any time you run an analysis, so you can’t make errors by skipping the visualization step. And if you have an app that produces or collects data for your users, you can use our API to build an Export to Statwing link so your users can get more value out of their data. 

 

Sorry, comments are closed for this post.