+34 616 71 29 85 carsten@dataz4s.com
Select Page

# The normal distribution

The normal distribution is widely used in statistics and probability theory whereas so many real-life datasets are normally distributed. It describes continuous random variables that can be measured like weight, length mass and others that can have an infinity of decimals.

## Even discrete variables become normally distributed

In this way, continuous variables are opposite to discrete variables which are has a fixed outcome like “Yes” and “No”, or like head or tail, or like number of people.

Even data that follows a discrete distribution becomes normally distributed when their sample sizes are sufficiently large. This is explained by the Central Limit Theorem. For example, a binomial experiment approximates a normal distribution when the sample size is increased to, say, 10,000.

## 5 characteristics

The normal distribution bell curve is symmetric around the mean, median and the mode which all three are located at the top point of the curve. As the curve is symmetric, the center of the curve splits the data into two equal areas. The normal distribution is symmetric around its mean and the total area under the normal curve = 1.0, or 100%.

In a perfect normal distribution mean = median = mode. More on mean, median and mode.

The normal distribution is also called the Gaussian, the Gauss, the Laplace-Gauss distribution and is even referred to as the bell curve, whereas the perfect normal distribution can be visualized by the symmetrical bell curve. Other distributions, like e.g. Student’s t-distribution, also have bell curves.

In most real-life datasets, the bell curve will not be 100% symmetric as this will take an infinitely large sample size, but the idealized bell curve can help explain, calculate and visualize data.

### Summarizing characteristics:

1. Symmetric around its mean
2. Mean = median = mode
3. The area under the curve = 1.0.
4. It has a high peak and light tails compared to the t-distribution
5. It is defined by the mean (μ) and the variance (σ2).

As explained by the Empirical Rule, 68% of the area of the normal distribution is within one standard deviation of the mean and approximately 95% of the area is within two standard deviations of the mean: ## Parameters and estimators

The population parameters and the sample estimators in the normal distribution are: ## The higher the n, the “peakier”

The variance and as such the standard deviation define the spread, or the dispersion, in the data. The greater the n, the “peakier” the curve becomes and the lighter the tails as a larger proportion of data will be centered around the mean. This can also be deducted from the formula of the sample variance: The higher sample size (n), the greater the denominator and thus, the smaller the sample variance. So, the higher the n, the lower the variance.

## Problems that follow the normal distribution

The normal distribution is a continuous probability distribution where X will be a random variable that cannot be counted as a finite entity:

Examples of measure data that can follow a normal distribution:

• height
• weight
• length
• mass
• time
• amounts

As described in continuous vs discrete data, the continuous variables are the ones that theoretically have an infinite value, like an infinite number of decimals.

Questions that can be answered following through the normal distribution:

• What is the mean height of Danish 2-year-old toddlers?
• What’s the probability of finding a Danish 2-year-old toddler higher than x?
• At what time do Oracle sales staffs arrive at their workplace?
• What’s the average price of diesel today in Malaga?
• What is the value of Google stock prices over the last 5 years?

## Calculating probabilities in the normal distribution

Say that we are to calculate the probability of the variable X that follows a normal distribution with known µ and σ. To calculate the probability of X, we can apply the z-score formula: Usually, the population standard deviation (σ) is unknown. In these cases, the difference calculated between the population mean and the sample mean is divided by the standard error (SE) which is s/sqrt n.

Example: Say we get a sample mean of 50 from a normally distributed population with a known standard deviation of 8, and we wish to the know the probability of randomly selecting a value less than 40.

We would calculate the z-score: 40-50/8 = -1.25. The corresponding p-value for -1.25 is 0.1057, which means that there is a 10.57% chance that we will randomly pick 40 or less.

## Normal distribution vs t-distribution

For sample distributions with unknown σ, the statistician might apply the Student’s t-distribution for n < 30. For n > 30 with unknown σ it is commonly taught that the normal distribution can be applied approximating the t-distributions.

The reason again is that the greater n the more the sample distribution approximates the normal distribution. However, it is commonly recommended from statisticians that whenever σ is unknown, the Student’ t-distributions should be applied.

Let’s see an example with different when using the t-distribution compared to the normal distribution. As illustrated below, the calculation of a 95% confidence interval for a population mean with known sigma has a z-value = 1.96 using the normal distribution whereas the t-table will show that the corresponding value in the t-distribution is greater than 1.96: ## 2 common Excel functions

The following Excel formulas can be used to calculate critical values and p-values in the normal distribution:

Finding critical Z-values with =NORM.S.INV (when sigma known):

Lower left-tailed tests: =NORM.S.INV(α)

Upper right-tailed tests: =NORM.S.INV(1-α)

Two-tailed tests: =NORM.S.INV(α/2)

Finding p-values with =NORM.DIST (when sigma known):

Lower left-tailed tests: =NORM.S.DIST(z,true)

Upper right-tailed tests: =1-NORM.S.INV(z,true)

Two-tailed tests: =2*(1-NORM.S.DIST(z,true))

## 4 normal distribution functions in R

Below, you will see a few simple examples of calculating probabilities, percentiles and taking random samples from a normally distributed variable with the R functions:

• pnorm gives the distribution function (calculates probabilities)
• qnorm gives the quantile function (calculates quantiles or percentiles)
• dnorm gives the density (find and/or plot the probability)
• rnorm generates random deviates (can generate random samples)

On the page The normal distribution in R, you will find more detailed description and worked examples.

Let’s calculate a few examples based on X following normal distribution with a mean of 65 and a standard deviation of 4:

### pnorm

The pnorm command can be used to calculate probabilities for a normal random variable:

# P(X <= 60)

# different coding for this calculation

pnorm(q=60, mean = 65, sd = 4, lower.tail = T)

##  0.1056498

pnorm(60,65,4)

##  0.1056498

#P(X >= 75)

# different coding for this calculation

pnorm(q=75, mean = 65, sd = 4, lower.tail = FALSE)

##  0.006209665

pnorm(75, 65, 4, F)

##  0.006209665

pnorm can also be used to calculate Z, the standard normal

# P(Z >= 1)

pnorm(q=1.5, mean = 0, sd = 1, lower.tail = FALSE)

##  0.0668072

pnorm(1.5,0,1,F)

##  0.0668072

### qnorm

The qnorm function can be used to calculate quantiles or percentiles for a normal random variable

# Find first quartile (Q1)

qnorm(p=0.25, mean=75, sd=5, lower.tail = T)

##  71.62755

### dnorm

the dnorm function can be used to find and/or plot the probability density function

# First, we create a sequence and assign this to x

x <- seq(from=50, to=80, by=0.25)

# Find the value of the probabililty density function for each of these x-values

dens <- dnorm(x, mean=65, sd=4)

plot(x, dens)

# Plot with line. title, labels, changing Y-values to horizontal and inserting a vertical line at mu=65:

plot(x, dens, type = “l”, main = “Normal dist for X: Mean=65, s=4)”, xlab = “x”, ylab = “Probability density”,las=1) + abline(v=65) ### rnorm

The rnorm function can be used to draw a random sample from a normally distributed population

rand <- rnorm(n=40, mean=65, sd=5)

rand

##   56.08986 60.49351 67.56576 60.25123 64.75519 66.06198 65.65582

##   64.85890 78.45019 67.03261 70.61278 74.86143 68.81081 57.53910

##  65.41073 64.67119 64.99964 71.95428 67.04310 63.48618 61.87884

##  63.30553 64.21462 67.63769 62.24213 64.91980 68.55189 70.08164

##  68.54631 56.43885 63.14660 65.24606 65.72192 65.55157 61.47583

##  61.12970 68.43353 63.41661 67.44550 55.34121

We recall, that even though the sample is drawn from a normally distributed population the sample observation might not seem normally distributed. This can be visualized in a histogram

hist(rand) As mentioned, more details on the page, The normal distribution in R.

## Learning resources:

Jeremy Balka’s (jbstatistics’) video:  An introduction to the Normal Distribution #### Carsten Grube

Freelance Data Analyst

p
p
p
##### ANOVA & the F-distribution +34 616 71 29 85

Call me Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

Drop me a line

What are you working on just now? Can I help you, and can you help me?