+34 616 71 29 85 carsten@dataz4s.com

Residual plots

The residual plots can reveal conditions that are hard to see from the regression line. In a glimpse the residual plot can cast the overall picture of the errors in the model and thus if the conditions for inference seem to be met. The residual plots basically graph the conditions listed with the LINER model.

 

Key points on residual plots

  • Residual plots can, by a glimpse, reveal what is not so obvious viewing the regression line
  • Residual plots can give a visual cast of the overall error situation in a regression model
  • Residual plots enable visual assessment of the error scenario in a regression model
  • Residual plots graph the conditions described in the LINER model

 

 

Reflection on the residuals

As described in Regression line the true regression model can be denoted:

Residual plots_true line formula

where Y has the linear relationship with β0 + β1X and epsilon (ε) is the random error component which indicates that the observed datapoints have some variability in Y around the regression line. They are randomly distributed around the line.

The estimate of this line can be denoted:

Residual plots_estimated line formula

No epsilon (ε) is needed in the estimated regression model, because it is an estimated line and therefore, per definition, includes errors: Estimation is not the “exact thing”, it includes errors. So, the estimated regression line is composed by the estimated point, or, we can say that the estimated points are the line.

 

The errors in the estimated line are the distances from each observed datapoint to the regression line as I describe in Squared errors of line and can be denoted: ei = YiŶi.

 

Errors of line

 

  

Residual plots scenarios

In residual plots the errors are displayed around their mean of 0. The residuals sum up to zero: ∑ei = 0. The following examples display the two scenarios: 1) that inference is possible and 2) that inference is not possible:

Residual plots => inference possible

Residual plot 1: The variability for each observed X value is more or less equal and that there is linearity and thus no curvature. There is no indication of non-normality. So, this example indicates that the can be valid:

Residual plots

 

Residual plot 2: This residual plot also seems to be a ‘fairly reasonable’ as the variability in Y values seems to be approximately equal for all observed X values:

 

 

Residual plots

 

Residual plots => inference not possible

 

Residual plot 3: This residual plot would lead to the conclusion that inference is not possible as variability in the Y values differs for the observed X values. The greater X, the greater the variability in Y:

Residual plots

 

 

Residual plot 4: This residual plot shows a clear curvature:

 

Curvature

 

 

 

 

Residual plot 5: This residual plot shows a pattern of non-linearity and non-normality:

 

Residual plots_Non normality

 

 

Example: A model that complies with the conditions

Here is an example of a scatter plot with its estimated regression line that seemingly is ‘ok’ and seemingly allows for statistical inference:

 

Residual plots example

 

Our residual shows no ‘greater’ systematic variability nor curvature:

Glove size residual plot

 

The quantile-quantile plot can be applied to check for normality which also seems to be occurring as the datapoints indicate a “reasonable” fit to the line:

Quantile-quantile plot

 

So, for this model, we would accept the conditions and proceed with the statistical analysis for inference.

 

 

Example: “Revealed” by the residual plot

This example returns a regression line with a strong fit that also seemingly is ok, but where the residual plot reveals a different picture:

About the dataset

The following example is based on the dataset for Jankar hardness vs. density for 36 Austrialian trees. I’ve been inspired by the JBstatistics video for Checking Assumptions with Residual Plots and caught the dataset via the PASWR2 package for R statistical programming.

It shows 36 Australian trees for which density is relatively easy to measure and hardness is difficult to measure.

 

Regression line and r2

If we assess that the conditions for inference are met, we can predict the hardness by simply measuring the density. We can predict the values that are difficult to measure by observing the values that are easy to measure.

Based on the 36 datapoints the regression model gives a very strong fit with a coefficient of determination (r2) of some 95%:

 

Australian trees regression line

 

 

Visualizing the regression line, it seems reasonable to think that this model complies with the condition for statistical inference. However, the residual plot reveals a different picture:

Australian trees residual plot

 

 

It shows that there is both difference in the variability for each X value and that there is curvature:

Australian trees checking residual plot

 

And if we take an extra look at the regression line, we can maybe discern these conditions:

Australian trees hardness vs. density regression

 

So, what we initially, might not perceive from the scatter plot with the regression line was revealed by the residual plot.

In cases like this, where the model is not appropriate, we can improve the model by adding an x2 term that could help fit a curve through the datapoints, or in other ways transform data to achieve a linearity.

 

Residual plots in Excel

By ticking off the options under Residuals in Data >> Data Analysis >> Regression, the residual plots:

Residual plots in Excel

 

 

Learning resources

I find these learning resources helpful for learning on residual plots:

Carsten Grube

Carsten Grube

Freelance Data Analyst

0 Comments

Submit a Comment

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

Drop me a line

What are you working on just now? Can I help you, and can you help me? 

About me

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children. 

What they say

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.