+34 616 71 29 85 carsten@dataz4s.com

Regression line

The regression line is also known as the trendline and in statistics as the line of best fit. It is the best fit to the relation between X and Y. Regression analysis explores the relationship between X and Y and the regression line is the model that is applied to forecast Y values for given X values. 

 

Regression line example (weak fit)

In real-world cases we will typically work with larger datasets. I will use a mini dataset of 4 datapoints to run some step-by-step examples below:

Let’s insert the four X and Y couples to a scatter plot:

From scatter plot to regression line

 

The regression line for these 4 datapoints would be the red line:

 

Regression line

 

  

But is this regression line a good/strong fit to the datapoints? This can be answered by the correlation coefficient (r):

 

Correlation coefficient r and regression line

The calculated correlation coefficient (r) for our line is 0.63 which gives an r2 of 0.40. Therefore, we can note that approximately 40% of the variation in Y can be explained by the variation in X.

That is a relatively low fit, so our model is not a good fit. We could already see this from the graph, because the datapoints fall relatively far from the line.

In some cases, this can be dealt with by transforming data and achieving linearity, see: Transforming data to achieve linearity.

In a real-world case, we would first of all consider getting more observations than just these four, but for this exercise example, I’ll leave it as is.

 

Regression line example (strong fit)

In the case of a strong fit we can use the line to Y values for given X values that are not included in our observations. Let’s take this mini dataset as an example:

Regression line

With an r2 = 0.974, we can say that approximately 97.4% of the variation in Y can be explained by the variation in X. That means that we can now predict a Y value for a given X value that has not been observed.

For example, there are cases when Y is difficult to measure and X easy to measure as in the case of the dataset of Australian timber. In this case the density of the wood (X) is easy to measure and the hardness/durability (Y) is difficult to measure.

With a strong fit we can immediately estimate a prediction of hardness (Y) for given densities (X) which, without the regression model, would have been very difficult and costly.

In our mini data example, there is no observation for X=2.5, but as our line makes out a very good fit to the datapoint, we can simply read from the line at X=2.5:

Reading the regression line

We find a Y around 2.3 at X=2.5. In real-world situation our statistical software will calculate this Y-value for us and associate the relative uncertainty as well.

 

No extrapolation

We cannot extrapolate. That means, that we can never follow the line outside of the observed X-values. We don’t know these areas, so we cannot just follow the regression line beyond the lowest and the highest X-values:

Extrapolation and the regression line

 

The reason that we cannot extrapolate is that we cannot estimate for an interval of X values that we have not observed. Outside the observed X values the relationship could be completely different, like this, where we only see from X≈90. Below, X≈90, we don’t know what happens, and it could be a completely different situation:

Extrapolation and the regression line

More on this in Caution in simple linear regression.

 

 

The estimated and the true line

When referring to the regression line, we often refer to the estimated regression line. With the estimated regression line, we intent to estimate the true line. In the following, we will see how the different notations for these two lines.

There are several different formulas and ways to calculate the different regression estimates like slope, intercept and others that I will get to in the chapters further ahead. So, you will find different formulas for calculating slope and intercept than the ones I’m using below.

 

The estimated regression line

The equation of the regression line is typically expressed in one of these ways:

 

Regression line formula. Different notations

 

However, we are estimating and as we saw in the scatter plots above, the regression line does not fall precisely through all datapoints. There is some variability, or some errors in the line. This is expressed in the equation for the true line:

 

The true regression line

The true line includes an epsilon (ε) which expresses that the line includes some variability in the Y values about the line:

True regression line formula. Different notations

The epsilon (ε) is a random error component. It expresses that the Y values will not fall exactly on the line. In other words, we can say that accounts for the variability of the Y values about the line. Or (very) loosely speaking the equation for the true line says: “This is the line, but just have in mind that it will have some errors.”

Therefore, the epsilon (ε) is not needed in the equation for the estimated line, because, as it is an estimation, it includes this “not-being-the-exact-sure-thing” as per definition. An estimation includes errors.

In the following I will work through the calculation of the estimated line using the notation: y=b+mx.

 

Calculating the estimated regression line

Let’s use the first displayed mini dataset from above to run through the following example calculating the estimated line:

 

The case

Say we are to analyze the relationship between production (in weight) in relation to the production costs for a dogfood production. It could be logic to think that the more dogfood we produce, the higher the total costs of production. But there could be synergy and bulk savings, so the relation could be non-linear.

We would wish answers for these questions:

  • Would it be a linear regression?
  • Would there be a linear relation between the two the produce in weight and the production costs?

We let the production in weight be the X and total production costs Y. In the first place, we saw that the correlation coefficient (r) is weak for this dataset, but let’s use the example for the sake of the exercise:

 

Step 1: Calculate the slope (m)

The following formula can be used for the calculation of the slope (m):

Regression line slope formula

 Where each of the four mean values are calculated in the very fraction. I prefer to run the calculation of each mean value separately: Here the by-hand calculations of the means of X, Y, XY and of X2:

Calculations for regression line slope

 

 

Here, the same calculations visualized in a spreadsheet:

Mean values for calculating regression line slope. Spreadsheet

 

 

Plugging these values into the slope-formula, we get a slope (m) = 0.40:

Formula and calculation regression line slope

 

Step 2: Calculate the intercept (b)

The intercept (b) is the y-intercept where X=0 which is the same as where the regression line intercepts (crosses) the vertical Y line. The formula and the calculation returns an intercept (b) of 1.0: 

Formula and calculation regression line intercept

 

Step 3: Express line & equation

We have now calculated the slope (0.40) and the y-intercept (1.0) and we then get the regression line equation:

y = 1 + 0.4X

The slope (m) = 0.40 means that when X increases by 1, Y increases by 0.4, so we can then draw the line:

Regression line graph with illustrated intercept

 

 

The regression line in Excel

In Excel you can use the Data >> Data Analysis >> Regression checking the box:

Regression line in Excel

Another option to use the Insert >> Scatter plot and in Chart Layout choose the Quick Layout that includes the trendline and the equation.

 

Learning statistics

I find these tutorials on the regression line very helpful:

Carsten Grube

Carsten Grube

Freelance Data Analyst

0 Comments

Submit a Comment

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

Drop me a line

What are you working on just now? Can I help you, and can you help me? 

About me

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children. 

What they say

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.