# Influential points

Influential points in simple linear regression are points that, when removed from the calculation, cause a ‘great’ change in the regression line. The term ‘influential points’ is typically applied when assessing **outliers. I**nfluential points tipically have high **leverage** (extreme in X) and/or high **residual** (extreme in Y).

**About this chapter**

This chapter is basically a walk-through of **Jeremy Balka’s** video Leverage and influential points in simple linear regression where the concepts of influential points and leverage are explained in theory and by examples. The graphs below are hand drawn for illustrative purpose only.

** **

**Outlier with little influence**

I find that influential points and leverage in SLR is best explained by visualizing them in plots:

**Residual** is another term and refers to the vertical distance (Y) from the line compared to most points. If it is ‘far’ above or below the line and ‘far’ from most points, it has a high residual.

**Outlier 1** has a large **residual** as it falls with a ‘greater’ **vertical distance** from the regression line. As the graph illustrates, the outlier has **some influence** on the regression line.

**Outlier with more influence**

Outlier 2 has greater influence on the regression line:

Outlier 2 is **extreme in X **which is expressed as a **high leverage** **(X)**. At the same time, it has a **high residual (Y)**. Outlier 3 has **high leverage** but **low residual** and becomes **little influential**. The combination between high leverage and high residual makes it **highly influential**.

**Degree of influence: **The degree of how influential a data point is, is a combination of how extreme it is in X and Y. In an exaggerated interpretation, it could be explained like this:

- Extreme X, Extreme Y => High influence
- Extreme X, not extreme Y => Low influence
- Not extreme X, Extreme Y => Some influence
- Not extreme X, Not extreme Y => No influence

**Graph examples**

These two graphs make it possible to visualize the 3 outliers together:

The following graph helps visualizing the **degrees of influence visually**:

** **

**How to handle influential points?**

For data points that are highly influential we should **first make sure that the observation is registered correctly**. If it turns out to be correctly reported, we should try to **remove** **it **from the data and see just how much it influences the regression line and ask the question: *Does it make us change the conclusion?*

We cannot let a single datapoint, or even a few datapoints, change our statistical conclusion. If it does make us change the statistical conclusions, we should try to get more observations near that X value.

If we cannot manage to get more observations at the given X value or close to, and we can see that it does make a change in our statistical conclusion, we can consider to do either two reports: One with the influential point and one without.

For this last report, we don’t include the influential point in the analysis, but we do report it. This procedure, of not including the influential point in the analysis, can bring in a high risk of mistakes.

**Highly influential point with low residuals**

Sometimes one or more influential points can have such a high influence on the regression line that they bring the line closer to them:

This graph shows that **highly influential data points can have low residual** being placed close to mean Y (ȳ), so we cannot conclude that all influential points lie far from the line with large residuals.

**Influential points and leverage via math**

The leverage can be measured with this formula:

The formula shows that the closer the Xi value falls to the mean of X, the lower the leverage. Different indicators can be used to measure the influence, like the “Cook’s Distance” (Wikipedia page for: Cook’s distance).

**Cook’s distance** measures how much the regression line changes when a point is removed from the calculation. In other word, Cook’s distance **measures how much the line changes when a point is removed from the calculation**.

These measures allow us to consider which datapoints should have “some extra” attention. Cook’s distance is usually provided in statistical software.

## Learning statistics

The JBstatistics video is, as mentioned, Jeremy Balka’s video which has been highly influential for this chapter – thanks Jeremy!

- JBstatistics (video 7:13): Leverage and influential points in simple linear regression
- Onlinestatbook (text page): Influential observations
- Stattrek (text page with embedded video): Influential points

#### Carsten Grube

Freelance Data Analyst

##### Normal distribution

##### Confidence intervals

##### Simple linear regression, fundamentals

##### Two-sample inference

##### ANOVA & the F-distribution

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

**Drop me a line**

*What are you working on just now? Can I help you, and can you help me? *

**About me**

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.

**Connect with me**

**What they say**

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.

## 0 Comments