+34 616 71 29 85 carsten@dataz4s.com

Influential points

Influential points in simple linear regression are points that, when removed from the calculation, cause a ‘great’ change in the regression line. The term ‘influential points’ is typically applied when assessing outliers. Influential points tipically have high leverage (extreme in X) and/or high residual (extreme in Y).

 

About this chapter

This chapter is basically a walk-through of Jeremy Balka’s video Leverage and influential points in simple linear regression where the concepts of influential points and leverage are explained in theory and by examples. The graphs below are hand drawn for illustrative purpose only.

 

Outlier with little influence

I find that influential points and leverage in SLR is best explained by visualizing them in plots:

Influential points graph

 

Residual is another term and refers to the vertical distance (Y) from the line compared to most points. If it is ‘far’ above or below the line and ‘far’ from most points, it has a high residual.

Outlier 1 has a large residual as it falls with a ‘greater’ vertical distance from the regression line. As the graph illustrates, the outlier has some influence on the regression line.

 

Outlier with more influence

Outlier 2 has greater influence on the regression line:

High leverage_high residual_high influence

 

Outlier 2 is extreme in X which is expressed as a high leverage (X). At the same time, it has a high residual (Y). Outlier 3 has high leverage but low residual and becomes little influential. The combination between high leverage and high residual makes it highly influential.

Degree of influence: The degree of how influential a data point is, is a combination of how extreme it is in X and Y. In an exaggerated interpretation, it could be explained like this:

  • Extreme X, Extreme Y => High influence
  • Extreme X, not extreme Y => Low influence
  • Not extreme X, Extreme Y => Some influence
  • Not extreme X, Not extreme Y => No influence

 

Graph examples

These two graphs make it possible to visualize the 3 outliers together:

Influential points_Comparing outliers

 

The following graph helps visualizing the degrees of influence visually:

Influential points visually

 

How to handle influential points?

For data points that are highly influential we should first make sure that the observation is registered correctly. If it turns out to be correctly reported, we should try to remove it from the data and see just how much it influences the regression line and ask the question: Does it make us change the conclusion?

We cannot let a single datapoint, or even a few datapoints, change our statistical conclusion. If it does make us change the statistical conclusions, we should try to get more observations near that X value.

If we cannot manage to get more observations at the given X value or close to, and we can see that it does make a change in our statistical conclusion, we can consider to do either two reports: One with the influential point and one without.

For this last report, we don’t include the influential point in the analysis, but we do report it. This procedure, of not including the influential point in the analysis, can bring in a high risk of mistakes.

 

Highly influential point with low residuals

Sometimes one or more influential points can have such a high influence on the regression line that they bring the line closer to them:

Degrees of influence, visually

 

This graph shows that highly influential data points can have low residual being placed close to mean Y (ȳ), so we cannot conclude that all influential points lie far from the line with large residuals.

 

Influential points and leverage via math

The leverage can be measured with this formula:

Leverage formula_influential points

The formula shows that the closer the Xi value falls to the mean of X, the lower the leverage. Different indicators can be used to measure the influence, like the “Cook’s Distance” (Wikipedia page for: Cook’s distance).

Cook’s distance measures how much the regression line changes when a point is removed from the calculation. In other word, Cook’s distance measures how much the line changes when a point is removed from the calculation.

These measures allow us to consider which datapoints should have “some extra” attention. Cook’s distance is usually provided in statistical software.

 

 

Learning statistics

The JBstatistics video is, as mentioned, Jeremy Balka’s video which has been highly influential for this chapter – thanks Jeremy!

Carsten Grube

Carsten Grube

Freelance Data Analyst

0 Comments

Submit a Comment

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

Drop me a line

What are you working on just now? Can I help you, and can you help me? 

About me

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children. 

What they say

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.