# Confidence intervals

*“We can be 95% confident that the true population parameter lies within this interval”. *This could be a conclusion on a confidence interval. Confidence intervals provide an idea about the true value of the population based on our sample material. A confidence interval is a plausible interval of values for a population parameter and expresses how confident we can be that the population parameter is included in the interval.

## The basic idea of confidence intervals

A confidence interval is a range of values within which we are confident, to a certain degree, that the population parameter is expected to fall, based on our sample results.

Say we do a 95% confidence interval for a sample mean (x̄) of 120 and that we calculate a confidence interval of 115 to 125. This means that **we can** **be 95% confident that the population mean (****µ****) is somewhere between 115 and 125**.

We can calculate confidence intervals for population means and confidence intervals for population proportion. For population means we usually apply t-statistics as we calculate with the sample standard deviation (s). For population proportions we can calculate the population standard deviation (σ) and therefore we apply z-statistics.

The more “secure”, or confident, we wish to be, the wider the interval:

Let’s take the example from above with a 95% confidence interval of 115 to 125. If we would wish to be 99% confident the interval could widen to for example 36-44.

## What confidence intervals express and what they don’t express

How the confidence intervals are interpreted and communicated can be a little subtle. Let’s see the difference between the correct and the incorrect interpretations on a 95% confidence interval:

### Correct interpretations

- We are 95% confident that the true parameter lies within this confidence interval.
- We are 95% confident that the confidence interval contains the parameter
- We are 95% confident that the mean is X with a margin of error of Y
- Repeated sampling captures the mean in about 95% of the samples

### Incorrect interpretations

- This confidence interval will occur in 95% of the trials
- There is a 95% probability that µ is contained in our interval
- Repeated sampling will result in a sample mean that lies within our confidence interval in about 95% of the times
- Applying the Empirical Rule, a 95% confidence interval can be calculated and visualized in the normal density curve as the values between +/- 2 standard deviation from the mean
- Confidence intervals apply to the studied population

Regarding the last incorrect interpretation about **confidence intervals applying to the studied population. **This is not true, as we study a sample and not a whole population. We study all data from the sample, but not all data from the population. **So, confidence intervals do not apply to the studied population, but to the sample.**

## Point estimate not useful in itself

“I have an average of 25 minutes to work every morning”, or: “Rafael Nadal wins 87% of all his ATP tournament matches “, or “the average time for running a certain procedure in our production line is 40 minutes”. We often hear these kinds of expressions in our everyday life. **In statistics these values are referred to as point estimates**.** **

Referring to the example above, say we daily are running a certain procedure for which we wish to estimate the average duration. We know that the point estimate for the mean time is **40 minutes**. But this point estimate does not explain the spread in data.** Is it 10 minutes one day and 70 minutes another day?** The point estimate does not explain this.

Say we get a 95% confidence interval from 38 to 42. That means that **we can be 95% confident that our true mean lies in the interval of 38-42 minutes**.

## Confidence ≠ Sureness

Confidence intervals are used to estimate for a larger population that we cannot measure directly. Therefore, we estimate the population. We use sample statistics to estimate, and as such, the essence of confidence relates to the fact that we are only estimating and not measuring in the true population.

In other words, we cannot be completely sure that we really do “catch” the true mean in our interval. Therefore, we use the term, *confidence, *as** we can only have some degree of confidence, not of sureness. **

## Conditions for valid confidence intervals

The following three conditions go for both confidence intervals for means and confidence intervals for proportions which are detailed in each of these chapters. Here the overall conditions:

- Simple random sample
- Normally distributed
- Independent sample

## Calculation of confidence interval

Confidence intervals split the mean into an interval. It splits the mean up to two equal sized intervals: One for the upper and a lower limit. For example, if our mean is 10 and our confidence interval is 2, we get an upper limit of 12 and a lower of 8.

The value that is added to ad subtracted from the mean is the** margin of error (ME)**, so, in short, the confidence interval is:

And the formula explained:

** **

So, alpha, t-score, standard error and margin of error are the bricks in the construction of the confidence interval formula:

### Alpha (α)

Statistics often work with confidence intervals of 90%; 95% and 99%. For example, the pharma industry typically works with a 99% level. The alpha (α) is the area outside of the confidence interval.** If the confidence interval is 95%, the alpha is 5%. **Alpha is usually denoted as a proportion: 0.05 and, so Alpha = 1 – confidence level.

**The ****α****/2 expresses** that it is the z-score for that given alpha value, which is divided by two, as it is split into an upper and a lower limit. Thus, the alpha level of 0.05 becomes 0.025 in both sides.

### t-score table

The **σ is practically almost unknown **and we therefore mostly apply the t-table when calculating for population means.

For the sake of the exercise, I will look up in the **t-score table**, although the statistical software packages have the probability tables embedded.

Alpha (α) is divided by 2 whereas we have two critical values: the upper and the lower. Say we are supposed to find the t-score for a sample size of 21 at a 95% confidence interval:

In the *df* (degrees of freedom) column we look at 20 (=n-1). We are conducting a 95% confidence interval at α = 0.05, so we follow the 0.025 column and find our t-score value at 2.086. The corresponding value in the z-table is 1.96 showing that the t-score widens the interval compared to the z-score.

### Standard error

**The standard error of the mean is the **sample standard deviation seen in relation to the sample size: The calculation of the standard error of the mean is therefore sample standard deviation divided by the sample size squared:

## Visualizing confidence intervals in bell curve

As mentioned above (under ** ‘Incorrect interpretations of confidence intervals’**), it might be tempting to say that the confidence interval can be visualized directly from the normal bell curve with based on the Empirical Rule. For example, that a 95% confidence interval are the values between +/- 2 standard deviation from the mean.

**This is incorrect**.

If the confidence intervals was to be visualized in a normal bell curve it would take out something like this:

** **

** **

## Formula for confidence interval

Now that we have our margin of error, we can complete the formula for the confidence interval:

Say we are going conduct a **95% confidence interval** for the production procedure example mentioned above: We have run a sample of **30 observations** during the past 30 production days. Sample mean **( µ) = 40**. The standard deviation

**(**. Assuming, that the observations are normally distributed, we can now calculate our 95% confidence interval:

*s*) = 7Our confidence interval is **2.61** which is subtracted from our mean in order to find the lower limit and added to the mean to find the upper limit. We then get a confidence interval of **37.5 to 42.5**. The confidence interval result can be expressed: *“We can feel 95% confident that the true population mean lies within the interval of 37.5 to 42.5.*

## Confidence intervals in MS Excel

The Excel functions **=CONFIDENCE.T** can be used to calculate the confidence intervals for a mean following a t distribution with unknown σ. The **=CONFIDENCE.NORM** function can be used for z-intervals which a highly unusual whereas **σ almost always is unknown**.

The following screenshot **shows the also the difference in using the z-table and the t-table**, as the t-statistics returns a greater ME and thereby a wider interval than the z-statistics;

** **

## Learning statistics

My favorite resources for learnings on confidence intervals:

- Jbstatistics video: Introduction to confidence intervals
- Khan Academy video: Confidence intervals and margin of error
- Simulator by Charlotte Allen on Khan Academy: https://www.khanacademy.org/computer-programming/confidence-intervals-about-a-proportion/6167177771548672
- Modern Drive by Chester Ismay and Albert Y. Kim (Bookdown format): Statistical Inference via Data Science
- Wolfram Mathematica: Short demo of their confidence interval simulator with link to the actual simulator: Confidence Intervals: Confidence Level, Sample Size, and Margin of Error

#### Carsten Grube

Freelance Data Analyst

##### Normal distribution

##### Confidence intervals

##### Simple linear regression, fundamentals

##### Two-sample inference

##### ANOVA & the F-distribution

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

**Drop me a line**

*What are you working on just now? Can I help you, and can you help me? *

**About me**

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.

**Connect with me**

**What they say**

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.

## 0 Comments