The Easiest Data Analysis Mistake to Make

Consider 4 datasets, each with an x and a y variable:

Dataset Mean of x Variance of x Mean of y Variance of y Correlation
(Pearson’s r)
1 9.00 11.00 7.50 4.12 0.816
2 9.00 11.00 7.50 4.12 0.816
3 9.00 11.00 7.50 4.12 0.816
4 9.00 11.00 7.50 4.12 0.816

Can you say with confidence that these datasets are essentially the same? Do you feel like you have a good sense of them?

You shouldn’t. Here’s those same 4 datasets (known as Anscombe’s Quartet) presented visually:

Anscombe's Quartet 1


Anscombe's Quartet 2


Anscombe's Quartet 3


Anscombe's Quartet 4

These are four very different datasets with very similar summary statistics. And if you made the very common mistake of analyzing without visualizing, you’d dramatically misunderstand them.


