+34 616 71 29 85 carsten@dataz4s.com

The normal distribution

The normal distribution is widely used in statistics and probability theory whereas so many real-life datasets are normally distributed. It describes continuous random variables that can be measured like weight, length mass and others that can have an infinity of decimals.

 

Even discrete variables become normally distributed

 In this way, continuous variables are opposite to discrete variables which are has a fixed outcome like “Yes” and “No”, or like head or tail, or like number of people.

Even data that follows a discrete distribution becomes normally distributed when their sample sizes are sufficiently large. This is explained by the Central Limit Theorem. For example, a binomial experiment approximates a normal distribution when the sample size is increased to, say, 10,000.

 

 

5 characteristics

The normal distribution bell curve is symmetric around the mean, median and the mode which all three are located at the top point of the curve.

Normal distribution bell curve

 

As the curve is symmetric, the center of the curve splits the data into two equal areas. The normal distribution is symmetric around its mean and the total area under the normal curve = 1.0, or 100%.

In a perfect normal distribution mean = median = mode. More on mean, median and mode.

The normal distribution is also called the Gaussian, the Gauss, the Laplace-Gauss distribution and is even referred to as the bell curve, whereas the perfect normal distribution can be visualized by the symmetrical bell curve. Other distributions, like e.g. Student’s t-distribution, also have bell curves.

In most real-life datasets, the bell curve will not be 100% symmetric as this will take an infinitely large sample size, but the idealized bell curve can help explain, calculate and visualize data.

 

Summarizing characteristics:

  1. Symmetric around its mean
  2. Mean = median = mode
  3. The area under the curve = 1.0.
  4. It has a high peak and light tails compared to the t-distribution
  5. It is defined by the mean (μ) and the variance (σ2).

 

As explained by the Empirical Rule, 68% of the area of the normal distribution is within one standard deviation of the mean and approximately 95% of the area is within two standard deviations of the mean:

 

The empirical rule

 

Parameters and estimators

The population parameters and the sample estimators in the normal distribution are:

The normal distribution

 

 

 

The higher the n, the “peakier”

The variance and as such the standard deviation define the spread, or the dispersion, in the data. The greater the n, the “peakier” the curve becomes and the lighter the tails as a larger proportion of data will be centered around the mean.

 

 

The normal distribution

 

 

This can also be deducted from the formula of the sample variance:

The normal distribution_variance

The higher sample size (n), the greater the denominator and thus, the smaller the sample variance. So, the higher the n, the lower the variance.

 

Problems that follow the normal distribution

The normal distribution is a continuous probability distribution where X will be a random variable that cannot be counted as a finite entity:

Examples of measure data that can follow a normal distribution:

  • height
  • weight
  • length
  • mass
  • time
  • amounts

As described in continuous vs discrete data, the continuous variables are the ones that theoretically have an infinite value, like an infinite number of decimals.

Questions that can be answered following through the normal distribution:

  • What is the mean height of Danish 2-year-old toddlers?
  • What’s the probability of finding a Danish 2-year-old toddler higher than x?
  • At what time do Oracle sales staffs arrive at their workplace?
  • What’s the average price of diesel today in Malaga?
  • What is the value of Google stock prices over the last 5 years?

 

Calculating probabilities in the normal distribution

Say that we are to calculate the probability of the variable X that follows a normal distribution with known µ and σ. To calculate the probability of X, we can apply the z-score formula:

The normal distribution calculations and expressions

 

Usually, the population standard deviation (σ) is unknown. In these cases, the difference calculated between the population mean and the sample mean is divided by the standard error (SE) which is s/sqrt n.

Example: Say we get a sample mean of 50 from a normally distributed population with a known standard deviation of 8, and we wish to the know the probability of randomly selecting a value less than 40.

We would calculate the z-score: 40-50/8 = -1.25. The corresponding p-value for -1.25 is 0.1057, which means that there is a 10.57% chance that we will randomly pick 40 or less.

 

Normal distribution vs t-distribution

For sample distributions with unknown σ, the statistician might apply the Student’s t-distribution for n < 30. For n > 30 with unknown σ it is commonly taught that the normal distribution can be applied approximating the t-distributions.

The reason again is that the greater n the more the sample distribution approximates the normal distribution. However, it is commonly recommended from statisticians that whenever σ is unknown, the Student’ t-distributions should be applied. 

Let’s see an example with different when using the t-distribution compared to the normal distribution. As illustrated below, the calculation of a 95% confidence interval for a population mean with known sigma has a z-value = 1.96 using the normal distribution whereas the t-table will show that the corresponding value in the t-distribution is greater than 1.96:

 

The normal distribution vs the t-distribution

 

 

2 common Excel functions

The following Excel formulas can be used to calculate critical values and p-values in the normal distribution:

 

Finding critical Z-values with =NORM.S.INV (when sigma known):

Lower left-tailed tests: =NORM.S.INV(α)

Upper right-tailed tests: =NORM.S.INV(1-α)

Two-tailed tests: =NORM.S.INV(α/2)

 

Finding p-values with =NORM.DIST (when sigma known):

Lower left-tailed tests: =NORM.S.DIST(z,true)

Upper right-tailed tests: =1-NORM.S.INV(z,true)

Two-tailed tests: =2*(1-NORM.S.DIST(z,true))

 

 

4 normal distribution functions in R

Below, you will see a few simple examples of calculating probabilities, percentiles and taking random samples from a normally distributed variable with the R functions:

  • pnorm gives the distribution function (calculates probabilities)
  • qnorm gives the quantile function (calculates quantiles or percentiles)
  • dnorm gives the density (find and/or plot the probability)
  • rnorm generates random deviates (can generate random samples)

On the page The normal distribution in R, you will find more detailed description and worked examples.

Let’s calculate a few examples based on X following normal distribution with a mean of 65 and a standard deviation of 4:

pnorm

The pnorm command can be used to calculate probabilities for a normal random variable:

# P(X <= 60)

# different coding for this calculation

pnorm(q=60, mean = 65, sd = 4, lower.tail = T)

## [1] 0.1056498

pnorm(60,65,4)

## [1] 0.1056498

#P(X >= 75)

# different coding for this calculation

pnorm(q=75, mean = 65, sd = 4, lower.tail = FALSE)

## [1] 0.006209665

pnorm(75, 65, 4, F)

## [1] 0.006209665

pnorm can also be used to calculate Z, the standard normal

# P(Z >= 1)

pnorm(q=1.5, mean = 0, sd = 1, lower.tail = FALSE)

## [1] 0.0668072

pnorm(1.5,0,1,F)

## [1] 0.0668072

 

qnorm

The qnorm function can be used to calculate quantiles or percentiles for a normal random variable

# Find first quartile (Q1)

qnorm(p=0.25, mean=75, sd=5, lower.tail = T)

## [1] 71.62755

 

dnorm

the dnorm function can be used to find and/or plot the probability density function

# First, we create a sequence and assign this to x

x <- seq(from=50, to=80, by=0.25)

 

# Find the value of the probabililty density function for each of these x-values

dens <- dnorm(x, mean=65, sd=4)

plot(x, dens)

# Plot with line. title, labels, changing Y-values to horizontal and inserting a vertical line at mu=65:

plot(x, dens, type = “l”, main = “Normal dist for X: Mean=65, s=4)”, xlab = “x”, ylab = “Probability density”,las=1) + abline(v=65)

 

Bell curve in R

 

rnorm

The rnorm function can be used to draw a random sample from a normally distributed population

rand <- rnorm(n=40, mean=65, sd=5)

rand

##  [1] 56.08986 60.49351 67.56576 60.25123 64.75519 66.06198 65.65582

##  [8] 64.85890 78.45019 67.03261 70.61278 74.86143 68.81081 57.53910

## [15] 65.41073 64.67119 64.99964 71.95428 67.04310 63.48618 61.87884

## [22] 63.30553 64.21462 67.63769 62.24213 64.91980 68.55189 70.08164

## [29] 68.54631 56.43885 63.14660 65.24606 65.72192 65.55157 61.47583

## [36] 61.12970 68.43353 63.41661 67.44550 55.34121

We recall, that even though the sample is drawn from a normally distributed population the sample observation might not seem normally distributed. This can be visualized in a histogram

hist(rand)

 

Histogram in R

 As mentioned, more details on the page, The normal distribution in R.

 

Learning resources:

Khan Academy pages and videos:

Jeremy Balka’s (jbstatistics’) video:  An introduction to the Normal Distribution

 

Carsten Grube

Carsten Grube

Freelance Data Analyst

0 Comments

Submit a Comment

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

Drop me a line

What are you working on just now? Can I help you, and can you help me? 

About me

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children. 

What they say

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.