The regression line is also known as the trendline and in statistics as the line of best fit. It is the best fit to the relation between X and Y. Regression analysis explores the relationship between X and Y and the regression line is the model that is applied to forecast Y values for given X values.
Regression line example (weak fit)
In real-world cases we will typically work with larger datasets. I will use a mini dataset of 4 datapoints to run some step-by-step examples below:
Let’s insert the four X and Y couples to a scatter plot:
The regression line for these 4 datapoints would be the red line:
But is this regression line a good/strong fit to the datapoints? This can be answered by the correlation coefficient (r):
The calculated correlation coefficient (r) for our line is 0.63 which gives an r2 of 0.40. Therefore, we can note that approximately 40% of the variation in Y can be explained by the variation in X.
That is a relatively low fit, so our model is not a good fit. We could already see this from the graph, because the datapoints fall relatively far from the line.
In some cases, this can be dealt with by transforming data and achieving linearity, see: Transforming data to achieve linearity.
In a real-world case, we would first of all consider getting more observations than just these four, but for this exercise example, I’ll leave it as is.
Regression line example (strong fit)
In the case of a strong fit we can use the line to Y values for given X values that are not included in our observations. Let’s take this mini dataset as an example:
With an r2 = 0.974, we can say that approximately 97.4% of the variation in Y can be explained by the variation in X. That means that we can now predict a Y value for a given X value that has not been observed.
For example, there are cases when Y is difficult to measure and X easy to measure as in the case of the dataset of Australian timber. In this case the density of the wood (X) is easy to measure and the hardness/durability (Y) is difficult to measure.
With a strong fit we can immediately estimate a prediction of hardness (Y) for given densities (X) which, without the regression model, would have been very difficult and costly.
In our mini data example, there is no observation for X=2.5, but as our line makes out a very good fit to the datapoint, we can simply read from the line at X=2.5:
We find a Y around 2.3 at X=2.5. In real-world situation our statistical software will calculate this Y-value for us and associate the relative uncertainty as well.
We cannot extrapolate. That means, that we can never follow the line outside of the observed X-values. We don’t know these areas, so we cannot just follow the regression line beyond the lowest and the highest X-values:
The reason that we cannot extrapolate is that we cannot estimate for an interval of X values that we have not observed. Outside the observed X values the relationship could be completely different, like this, where we only see from X≈90. Below, X≈90, we don’t know what happens, and it could be a completely different situation:
More on this in Caution in simple linear regression.
The estimated and the true line
When referring to the regression line, we often refer to the estimated regression line. With the estimated regression line, we intent to estimate the true line. In the following, we will see how the different notations for these two lines.
There are several different formulas and ways to calculate the different regression estimates like slope, intercept and others that I will get to in the chapters further ahead. So, you will find different formulas for calculating slope and intercept than the ones I’m using below.
The estimated regression line
The equation of the regression line is typically expressed in one of these ways:
However, we are estimating and as we saw in the scatter plots above, the regression line does not fall precisely through all datapoints. There is some variability, or some errors in the line. This is expressed in the equation for the true line:
The true regression line
The true line includes an epsilon (ε) which expresses that the line includes some variability in the Y values about the line:
The epsilon (ε) is a random error component. It expresses that the Y values will not fall exactly on the line. In other words, we can say that accounts for the variability of the Y values about the line. Or (very) loosely speaking the equation for the true line says: “This is the line, but just have in mind that it will have some errors.”
Therefore, the epsilon (ε) is not needed in the equation for the estimated line, because, as it is an estimation, it includes this “not-being-the-exact-sure-thing” as per definition. An estimation includes errors.
In the following I will work through the calculation of the estimated line using the notation: y=b+mx.
Calculating the estimated regression line
Let’s use the first displayed mini dataset from above to run through the following example calculating the estimated line:
Say we are to analyze the relationship between production (in weight) in relation to the production costs for a dogfood production. It could be logic to think that the more dogfood we produce, the higher the total costs of production. But there could be synergy and bulk savings, so the relation could be non-linear.
We would wish answers for these questions:
- Would it be a linear regression?
- Would there be a linear relation between the two the produce in weight and the production costs?
We let the production in weight be the X and total production costs Y. In the first place, we saw that the correlation coefficient (r) is weak for this dataset, but let’s use the example for the sake of the exercise:
Step 1: Calculate the slope (m)
The following formula can be used for the calculation of the slope (m):
Where each of the four mean values are calculated in the very fraction. I prefer to run the calculation of each mean value separately: Here the by-hand calculations of the means of X, Y, XY and of X2:
Here, the same calculations visualized in a spreadsheet:
Plugging these values into the slope-formula, we get a slope (m) = 0.40:
Step 2: Calculate the intercept (b)
The intercept (b) is the y-intercept where X=0 which is the same as where the regression line intercepts (crosses) the vertical Y line. The formula and the calculation returns an intercept (b) of 1.0:
Step 3: Express line & equation
We have now calculated the slope (0.40) and the y-intercept (1.0) and we then get the regression line equation:
y = 1 + 0.4X
The slope (m) = 0.40 means that when X increases by 1, Y increases by 0.4, so we can then draw the line:
The regression line in Excel
In Excel you can use the Data >> Data Analysis >> Regression checking the box:
Another option to use the Insert >> Scatter plot and in Chart Layout choose the Quick Layout that includes the trendline and the equation.
I find these tutorials on the regression line very helpful:
- Khan Academy (video 7:47): Fitting a line to data
- Jbstatistics (video 8:08): Introduction to simple linear regression
- Statistics How To (text page including embedded videos): Linear Regression: Simple Steps, Video. Find Equation, Coefficient, Slope
Freelance Data Analyst
+34 616 71 29 85
Spain: Ctra. 404, km 2, 29100 Coín, Malaga
Denmark: c/o Musvitvej 4, 3660 Stenløse
Drop me a line
What are you working on just now? Can I help you, and can you help me?
Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.
Connect with me
What they say
20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.