Influential points in simple linear regression are points that, when removed from the calculation, cause a ‘great’ change in the regression line. The term ‘influential points’ is typically applied when assessing outliers. Influential points tipically have high leverage (extreme in X) and/or high residual (extreme in Y).
About this chapter
This chapter is basically a walk-through of Jeremy Balka’s video Leverage and influential points in simple linear regression where the concepts of influential points and leverage are explained in theory and by examples. The graphs below are hand drawn for illustrative purpose only.
Outlier with little influence
I find that influential points and leverage in SLR is best explained by visualizing them in plots:
Residual is another term and refers to the vertical distance (Y) from the line compared to most points. If it is ‘far’ above or below the line and ‘far’ from most points, it has a high residual.
Outlier 1 has a large residual as it falls with a ‘greater’ vertical distance from the regression line. As the graph illustrates, the outlier has some influence on the regression line.
Outlier with more influence
Outlier 2 has greater influence on the regression line:
Outlier 2 is extreme in X which is expressed as a high leverage (X). At the same time, it has a high residual (Y). Outlier 3 has high leverage but low residual and becomes little influential. The combination between high leverage and high residual makes it highly influential.
Degree of influence: The degree of how influential a data point is, is a combination of how extreme it is in X and Y. In an exaggerated interpretation, it could be explained like this:
- Extreme X, Extreme Y => High influence
- Extreme X, not extreme Y => Low influence
- Not extreme X, Extreme Y => Some influence
- Not extreme X, Not extreme Y => No influence
These two graphs make it possible to visualize the 3 outliers together:
The following graph helps visualizing the degrees of influence visually:
How to handle influential points?
For data points that are highly influential we should first make sure that the observation is registered correctly. If it turns out to be correctly reported, we should try to remove it from the data and see just how much it influences the regression line and ask the question: Does it make us change the conclusion?
We cannot let a single datapoint, or even a few datapoints, change our statistical conclusion. If it does make us change the statistical conclusions, we should try to get more observations near that X value.
If we cannot manage to get more observations at the given X value or close to, and we can see that it does make a change in our statistical conclusion, we can consider to do either two reports: One with the influential point and one without.
For this last report, we don’t include the influential point in the analysis, but we do report it. This procedure, of not including the influential point in the analysis, can bring in a high risk of mistakes.
Highly influential point with low residuals
Sometimes one or more influential points can have such a high influence on the regression line that they bring the line closer to them:
This graph shows that highly influential data points can have low residual being placed close to mean Y (ȳ), so we cannot conclude that all influential points lie far from the line with large residuals.
Influential points and leverage via math
The leverage can be measured with this formula:
The formula shows that the closer the Xi value falls to the mean of X, the lower the leverage. Different indicators can be used to measure the influence, like the “Cook’s Distance” (Wikipedia page for: Cook’s distance).
Cook’s distance measures how much the regression line changes when a point is removed from the calculation. In other word, Cook’s distance measures how much the line changes when a point is removed from the calculation.
These measures allow us to consider which datapoints should have “some extra” attention. Cook’s distance is usually provided in statistical software.
The JBstatistics video is, as mentioned, Jeremy Balka’s video which has been highly influential for this chapter – thanks Jeremy!
- JBstatistics (video 7:13): Leverage and influential points in simple linear regression
- Onlinestatbook (text page): Influential observations
- Stattrek (text page with embedded video): Influential points
Freelance Data Analyst
+34 616 71 29 85
Spain: Ctra. 404, km 2, 29100 Coín, Malaga
Denmark: c/o Musvitvej 4, 3660 Stenløse
Drop me a line
What are you working on just now? Can I help you, and can you help me?
Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.
Connect with me
What they say
20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.