Precautions in simple linear regression
Precautions in simple linear regression are important to have when e.g. focusing on the regression equation, the plot of the regression line and when making inference on correlation between of X and Y.
Anscombe’s Quartet: Always start by plotting
We need to go further than just to look at our regression model as the same regression model can have completely different relationships. The Anscombe’s Quartet shows the following example of regression models that have identic regression lines, but that differ in all other values:
This regression line can be tempting to continue beyond its observed X values. Take this example:
It could be tempting to continue the model beyond the observed X values as the line seems to be a very good fit. But to go beyond the observed X values is called extrapolation and should be avoid. The reason is that we do not know the model beyond the observed X values. Take the plot above: The rest of the model is completely different:
As illustrated, we were only shown a part of the total model and had we extrapolated from the part we were given, we would have gone completely wrong.
Correlation does not imply causation
This is a made-up example of the relationship between households with at least two car (X) and life expectancy for women (Y):
It seems logic that we cannot help populations increasing their life expectancy by shipping them loads of cars to their homes. There is an underlying effect which is that, at the time this data collection was made, there was a correlation with number of cars per household and wealth. And wealth was the underlying effect. There was a higher life expectancy in wealthier countries.
Correlation needs well designed experiments, and it does not imply causation.
Precautions in simple linear regression – summarizing
Cautions in simple linear regression can be taken by:
- plotting the whole model including all observation (X)
- not extrapolating
- having in mind underlying effects and that correlation does not imply causation
- Jbstatistcs (video 5:24): Simple linear regression: Always plot your data
- Rafael Irizarry, Professor of Biostatistcs at Dana-Farber Cancer Institute (video 7:17): Data Science Linear Regression in R | Anscombe’s Quartet Stratification
- Udacity (video 1:07): Anscombe’s Quartet
Freelance Data Analyst
+34 616 71 29 85
Spain: Ctra. 404, km 2, 29100 Coín, Malaga
Denmark: c/o Musvitvej 4, 3660 Stenløse
Drop me a line
What are you working on just now? Can I help you, and can you help me?
Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.
Connect with me
What they say
20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.