Coefficient of determination, r2
The coefficient of determination, r², expresses how much of the total variation in Y is described by the variation in X. Thus, it expresses how well the estimated regression line fits the observed data.
Key points about the coefficient of determination, r²
- r² expresses how well the estimated regression line fits the observed datapoints
- A high r² (e.g. 0.9) means that it is a good fit and a low r² (e.g. 0.2) that it is a poor fit
- r² represents the scatter around the regression line. The closer to the line the higher coefficient of determination, r²
- r² is calculated by subtracting the errors from one, as one is the total sample space. So, removing the errors from one, is the fit.
Is the regression line a good fit?
Let’s take our 4 points mini dataset as example showing the squared errors of line:
The regression line does not go through any of the observed datapoints and some of points are even ‘pretty’ far from the line. For example, at X=2, the line seems to be ‘quite’ far from the point. And, as described in Regression line, this model has an r2 of only 0.40 which is ‘pretty’ low, and we might not trust it for forecasting.
So, the coefficient of determination, denoted by r2 tells us how good a fit the line is. An r2 of 0.85 says that 85% of the variation in Y is described by the variation in X. An r2 of 0.20 would be too low to call it a fit.
r² = 1 – errors
As r² is the “correct proportion of the line” it can help to understand the “incorrect proportion of the line”, which is the error. Because, the sample space consists of the correct proportion and the incorrect proportion. So, one minus the error (1-error) is the correct proportion which is the r².
Therefore, the formula for the coefficient of determination, r² is one minus the error, where the error is the SELine divided by SEӯ .
Calculating coefficient of determination, r²
In Squared error of line, we calculate the two values that compose our formula for r². These values are the sum of the squared error of the line (SELine) and the sum of the squared error of mean y (SEӯ). Our SELine is 1.2 and our SEӯ is 2.0, so we are now ready to calculate the r²:
This means that only 40% of the variation in Y can be explained by the variation in X, and the line is therefore not a good fit. In other words, our regression model is not reliable for predictive analysis.
Coefficient of determination ( r²) vs correlation coefficient (r)
r² is, as it says, r squared and, as such, these two expressions are similar. r² expresses the proportion of the variation in Y that is caused by variation in X. On the other hand, r expresses the strength, direction and linearity in the relation between X and Y.
Low r² does not invalidate the model
Our example showed a ‘poor’ fit with a coefficient of determination, r², of only 0.4. But, also, the dataset has only 4 datapoints. A different example, closer to real-life situations that will have more datapoints can take out this way:
The dots are ‘fairly’ close to the line which would return a ‘fairly’ high r². A low coefficient of determination, r², is not necessarily invalidating the model. As described in Squared errors of line, SELine and SEȳ, that compose the error of the line, are mean values.
So, even though r² is low, the model can still give us valuable information and predictions as the r² represent the mean change in Y for one unit change of X. r² represents the scatter around the regression line. The closer to the line the higher coefficient of determination, r²:
Coefficient of determination, r², in Excel
The coefficient of determination can be calculated with the RSQ function:
Another way is to run the regression analysis where r² also is included: Data >> Data Analysis >> Regression:
- Khan Academy (video 12:41): R-squared or coefficient of determination
- The Minitab Blog (text): Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?
- MIT OpenCourseWare (video 8:46, about r2 after 5:43): The statistical sommelier: An introduction to linear regression
Freelance Data Analyst
+34 616 71 29 85
Spain: Ctra. 404, km 2, 29100 Coín, Malaga
Denmark: c/o Musvitvej 4, 3660 Stenløse
Drop me a line
What are you working on just now? Can I help you, and can you help me?
Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.
Connect with me
What they say
20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.