# Correlation coefficient

The correlation coefficient describes ** how well the regression line fits** the given datapoints between X and Y. The correlation coefficient is denoted by

*r*. The closer

*r*is to 1 or to -1, the better the fit of the line

*r* expresses the strength of the regression line

The regression line is the best possible fit to the datapoints. But that doesn’t mean that it is a good fit. It’s only the “best possible”. As described in Scatter plots, the fit can be weak or strong, or anywhere in between. The correlation coefficient (r) describes this degree of strength in the line.

Here some eyeballed **examples explaining the correlation coefficient (r)**:

** **

## Calculating the correlation coefficient

The formula for calculating the correlation coefficient:

Subtracting the mean from each datapoint and dividing by the degrees of freedom gives us the Z-score. So, what this formula says is: The Z-score for X times the Z-score for Y seen in relation to the degrees of freedom and thereby to the sample size:** **

This can also be written as Z-score_{x} times Z-score_{y} / df:For the sake of learning by doing, I will take 4 datapoint example:

Visualizing the datapoints and the regression line in a scatterplot:

Now, let’s plug our values into the formula:

In a **spreadsheet** it could be set up as following:

Concluding on the correlation coefficient, r

So, we get a correlation coefficient of 0.94 which is “very” close to +1, so we would conclude that the line has a “very strong” fit. As we recall from Scatter plots the line can also be determined as *positive* and *linear*.

## Correlation coefficient *(r)* vs coefficient of determination *(**r²)*

The correlation coefficient (r) and the coefficient of determination (r²) are similar, just like the very denotation states as r² is, indeed, r squared. Whereas r expresses the degree of strength in the linear association between X and Y, **r² expresses the percentage, or proportion, of the variation in Y that can be explained by the variation in X**.

In our 4 datapoint mini example above we had the following results for r and r²:

- r = 0.94
- r² =0.94² = 0.89

As mentioned in the conclusion, the 0.94 expresses that we have a line with a very strong fit. And that r² = 0.89 means that **89% of the variation in Y can be explained by the variation of X**.

In other words, the regression line in our example has a very strong fit in which 89% of the variation in Y can be explained by the variation of X.

More on r² in Coefficient of determination, r²

## Correlation coefficient in MS Excel

To calculate the correlation coefficient in Excel you can take the square root (=SQRT) of the value calculated with the formula **=RSQ**. Another option is to run the regression analysis via **Data >> Data Analysis >> Regression**

** **

## Learning resources on correlation coefficient

I have found these resources helpful for learning on the correlation coefficient

- Khan Academy (video 7:20): Correlation coefficient intuition
- Khan Academy (video 12:21): Calculating correlation coefficient
- ThoughtCo (text page): Calculating the Correlation Coefficien

#### Carsten Grube

Freelance Data Analyst

##### Normal distribution

##### Confidence intervals

##### Simple linear regression, fundamentals

##### Two-sample inference

##### ANOVA & the F-distribution

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

**Drop me a line**

*What are you working on just now? Can I help you, and can you help me? *

**About me**

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.

**Connect with me**

**What they say**

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.

## 0 Comments