+34 616 71 29 85 carsten@dataz4s.com

Coefficient of determination, r2

The coefficient of determination, r², expresses how much of the total variation in Y is described by the variation in X. Thus, it expresses how well the estimated regression line fits the observed data.

 

Key points about the coefficient of determination, r²

  •  r² expresses how well the estimated regression line fits the observed datapoints
  • A high r² (e.g. 0.9) means that it is a good fit and a low r² (e.g. 0.2) that it is a poor fit
  •  r² represents the scatter around the regression line. The closer to the line the higher coefficient of determination, r²
  •  r² is calculated by subtracting the errors from one, as one is the total sample space. So, removing the errors from one, is the fit.

 

Is the regression line a good fit?

Let’s take our 4 points mini dataset as example showing the squared errors of line:

Coefficient of determination, r2_dataset example

The regression line does not go through any of the observed datapoints and some of points are even ‘pretty’ far from the line. For example, at X=2, the line seems to be ‘quite’ far from the point. And, as described in Regression line, this model has an r2 of only 0.40 which is ‘pretty’ low, and we might not trust it for forecasting.

So, the coefficient of determination, denoted by r2 tells us how good a fit the line is. An r2 of 0.85 says that 85% of the variation in Y is described by the variation in X. An r2 of 0.20 would be too low to call it a fit.

 

 r² = 1 – errors

As r² is the “correct proportion of the line” it can help to understand the “incorrect proportion of the line”, which is the error. Because, the sample space consists of the correct proportion and the incorrect proportion. So, one minus the error (1-error) is the correct proportion which is the r².

Therefore, the formula for the coefficient of determination, r² is one minus the error, where the error is the SELine divided by SEӯ .

 

Calculating coefficient of determination, r²

In Squared error of line, we calculate the two values that compose our formula for r². These values are the sum of the squared error of the line (SELine) and the sum of the squared error of mean y (SEӯ). Our SELine  is 1.2 and our  SEӯ is 2.0, so we are now ready to calculate the r²:

Coefficient of determination, r2_formula and calculation

This means that only 40% of the variation in Y can be explained by the variation in X, and the line is therefore not a good fit. In other words, our regression model is not reliable for predictive analysis.

 

Coefficient of determination ( r²) vs correlation coefficient (r)

 r² is, as it says, r squared and, as such, these two expressions are similar.  r² expresses the proportion of the variation in Y that is caused by variation in X. On the other hand, r expresses the strength, direction and linearity in the relation between X and Y.

 

Low r² does not invalidate the model

Our example showed a ‘poor’ fit with a coefficient of determination, r², of only 0.4. But, also, the dataset has only 4 datapoints. A different example, closer to real-life situations that will have more datapoints can take out this way:

Coefficient of determination, r2_Good fit high r2 example

 

 

 

The dots are ‘fairly’ close to the line which would return a ‘fairly’ high r². A low coefficient of determination, r², is not necessarily invalidating the model. As described in Squared errors of line, SELine and SEȳ, that compose the error of the line, are mean values.

So, even though r² is low, the model can still give us valuable information and predictions as the r² represent the mean change in Y for one unit change of X. r² represents the scatter around the regression line. The closer to the line the higher coefficient of determination, r²:

 

Coefficient of determination, r2_Poor fit low r2 example

 

 

Coefficient of determination, r², in Excel

The coefficient of determination can be calculated with the RSQ function:

Coefficient of determination, r2 in Excel RSQ function

 

Another way is to run the regression analysis where r² also is included: Data >> Data Analysis >> Regression:

Coefficient of determination, r2 in Excel Data Analysis Regression

 

 

Learning statistics

 

Carsten Grube

Carsten Grube

Freelance Data Analyst

0 Comments

Submit a Comment

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

Drop me a line

What are you working on just now? Can I help you, and can you help me? 

About me

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children. 

What they say

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.