+34 616 71 29 85 carsten@dataz4s.com
Select Page

# Correlation coefficient

The correlation coefficient describes how well the regression line fits the given datapoints between X and Y. The correlation coefficient is denoted by r. The closer r is to 1 or to -1, the better the fit of the line

## r expresses the strength of the regression line

The regression line is the best possible fit to the datapoints. But that doesn’t mean that it is a good fit. It’s only the “best possible”. As described in Scatter plots, the fit can be weak or strong, or anywhere in between. The correlation coefficient (r) describes this degree of strength in the line.

Here some eyeballed examples explaining the correlation coefficient (r):

## Calculating the correlation coefficient

The formula for calculating the correlation coefficient:

Subtracting the mean from each datapoint and dividing by the degrees of freedom gives us the Z-score. So, what this formula says is: The Z-score for X times the Z-score for Y seen in relation to the degrees of freedom and thereby to the sample size:

This can also be written as Z-scorex times Z-scorey / df:For the sake of learning by doing, I will take 4 datapoint example:

Visualizing the datapoints and the regression line in a scatterplot:

Now, let’s plug our values into the formula:

Which also can be written:

In a spreadsheet it could be set up as following:

Concluding on the correlation coefficient, r

So, we get a correlation coefficient of 0.94 which is “very” close to +1, so we would conclude that the line has a “very strong” fit. As we recall from Scatter plots the line can also be determined as positive and linear.

## Correlation coefficient (r) vs coefficient of determination (r²)

The correlation coefficient (r) and the coefficient of determination (r²) are similar, just like the very denotation states as r² is, indeed, r squared. Whereas r expresses the degree of strength in the linear association between X and Y, r² expresses the percentage, or proportion, of the variation in Y that can be explained by the variation in X.

In our 4 datapoint mini example above we had the following results for r and r²:

• r = 0.94
• r² =0.94² = 0.89

As mentioned in the conclusion, the 0.94 expresses that we have a line with a very strong fit. And that r² = 0.89 means that 89% of the variation in Y can be explained by the variation of X.

In other words, the regression line in our example has a very strong fit in which 89% of the variation in Y can be explained by the variation of X.

More on r²  in Coefficient of determination, r²

## Correlation coefficient in MS Excel

To calculate the correlation coefficient in Excel you can take the square root (=SQRT) of the value calculated with the formula =RSQ. Another option is to run the regression analysis via Data >> Data Analysis >> Regression

## Learning resources on correlation coefficient

I have found these resources helpful for learning on the correlation coefficient

#### Carsten Grube

Freelance Data Analyst

p
p
p
##### ANOVA & the F-distribution

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

Drop me a line

What are you working on just now? Can I help you, and can you help me?