Comparing two means
Comparing two means is often seen in tests of e.g. a new drug, a diet, or the like. Say we are testing Diet A for its efficiency compared to Diet B. We have a test group of people following the Diet A and a control group of people follow Diet B.
To make inference on whether Diet A leads to a greater weight loss than Diet B, we compare the mean weight losses between the test group and the control group. Is there evidence to support that the Diet A is leads to a greater weight loss than Diet B? A hypothesis test for the means of the two groups tested answers for this. Further, confidence intervals for two means show what interval of differences we can be expect.
σ² rarely known
Other statistical procedures that can help solve for this problem are the pooled variance t-procedures, the Welch t-procedure, Wilcoxon rank-sum test, permutation tests and bootstrapping.
The sample distribution of X and Y
Say we have two random variables: X and Y and that they are normally distributed:
The sample distribution of the means
The sample distribution of the sample mean of X (X̄) and Y (Ȳ) with a relatively large sample size (n) approximates a normal distribution. This is explained by the Central Limit Theorem which further explains why the density curves narrower the larger the n:
The sample distribution of the difference
The sample distribution of the difference between X̄ and Ȳ is defined by:
Another way of denoting this is if we denote the two means as X1 and X2:
The sample variance of the sampling distribution of the differences of means is the variance of X̄ plus the variance of Ȳ which are both equal to the population variances divided by their respective sample sizes.
And since we usually do not know the population variances, we can approximate them with the sample variances. And the general guideline to accept this is that the sample sizes for each of the groups are greater than 30.
Confidence interval of difference of means
Based on the concept of confidence intervals, say we want to have a 95% confidence interval for the true mean difference:
Say, we wish to test if a ‘high fat/low carb’ (Diet A) diet results in a greater weight loss than a diet of ‘low fat/high carb’ (Diet B) for persons that are “highly” overweight.
A randomly selected group of 150 “highly” overweighed persons join the test group and follow Diet A for 2 months. Another group of 150 randomly selected and “highly” overweighed persons volunteer to follow Diet B.
After the two months, the sample outcomes are:
- Group A: Mean weight loss (X̄A) = 5.87 kg. Std. dev. (sA) = 2.94 kg.
- Group B: Mean weight loss (X̄B) = 2.36 kg. Std. dev. (sB) = 1.32 kg.
So, Diet A had a mean loss that was (5.87-2.36) 3.51 kg greater than the one of Diet B.
Can we, out of this sample estimate, conclude that Diet A results in a greater weight loss than Diet B for ‘highly’ overweight persons? A hypothesis test for the two means will answer to this question.
But first, let’s get a 95% confidence interval for the difference of these two means. The 95% confidence interval will return an interval for which we can be 95% confident that our true mean difference lies within.
The confidence interval formula is similar to the “usual” confidence interval structure where we add and subtract the margin of error (ME) with estimated mean difference:
Explained in more details:
The critical z-value is, for a 95% confidence interval, 1.96, because we are running a two tailed test with upper and lower values. And we apply an estimated sample variance as we don’t know true population variance:
First, let’s recall our sample outcome:
- Diet A: Mean weight loss (X̄A) = 5.87 kg. Std. dev. (sA) = 2.94 kg.
- Diet B: Mean weight loss (X̄B) = 2.36 kg. Std. dev. (sB) = 1.32 kg.
- Difference in weight loss = d = (X̄A) – (X̄B) = 5.87-2.36 = 3.51
Hence, we can be 95% confident that our true mean difference lies in the interval of 2.99 to 4.03 kilograms. As we recall from Confidence intervals, this can be expressed in different ways:
- We are 95% confident that the interval [2.99; 4.03] contains the parameter
- We are 95% confident that the mean is 3.51 with a margin of error of 0.52
- Repeated sampling will capture the mean in approximately 95% of the samples
In case our margin of error (ME) would have implied a confidence interval that included zero, it would indicate that there could be 0 difference between the means. Loosely speaking, difference between the means would be likely include zero. Therefore, we would not have had support for Diet A being more efficient than Diet B.
Our 95% confidence interval does not include zero. Therefore, there is sufficient evidence that there is a difference between the two means, and we can say that there is evidence that Diet A will result in a greater weight loss for these persons than Diet B. We can be 95% confident that there will be a greater weight loss for highly overweighed person taking Diet A during 2 months than it would be the case if they go for Diet B.
Let’s conduct a hypothesis test to formulate this in mathematical terms:
Hypothesis test for difference of means
The null hypothesis, in hypothesis test for difference of means, usually states that there is no difference in the means. This leads to a two-tailed test.
However, I already know the sample results saying that Diet A has returned a greater weight loss than Diet B. So, my question is: Can this be true? Is Diet A more efficient than Diet B? This leads to a one-tailed test as I will be looking in the right tail of the distribution:
The test statistic has the same structure as we know from other test statistic calculations, like for example the z-score where we divide our sample mean difference with our sample standard deviation:
Our test statistic is 13.36 which is ‘far beyond’ our critical z-value of 1.65 at a significance level (α) of 0.05. We can therefore reject the null hypothesis and say that we have strong evidence that there is a difference between the two means. In other words, our sample give strong evidence that the Diet A results in greater weight loss than Diet B.
The critical value at 1.65 standard deviations from our mean difference returns a value of 0.43. This means that, at significance level (α) of 5%, we would reject values equal to and greater than 0.43. Our test statistic of 13.36 is ‘far beyond’ the 0.43 and hence we reject H0.
Running the test in statistical software we will find a ‘microscopic’, almost non-existing, p-value which expresses that there is practically 0 probability that we would get a result as extreme, as the one we got, assuming that there would be no difference or that B should be greater than A.
Confidence interval & hypothesis test together
Our 95% confidence interval did not include 0, so we knew that our hypothesis test should reject H0. Furthermore, the lower end of the confidence interval was 2.99 which could be determined as ‘pretty far’ from 0.
Why would we then run a hypothesis test if we already knew up-hand that H0 would be rejected? Because, the hypothesis test is the formal way of concluding whether we can reject that the difference is 0 or not.
Test & interval for smaller difference
In the example above, we get a much larger test statistics and we can even see from the difference between the means, that it seems to be significantly greater for the diet.
Now, say that after 4 months of running the test group on the diet parallel with the control group we get these sample outcomes:
- Group A: Mean weight loss (X̄A) = 9.31 kg. Std. dev. (sA) = 4.66 kg.
- Group B: Mean weight loss (X̄B) = 7.70 kg. Std. dev. (sB) = 4.31 kg.
The difference here is only 1.61 kg. With this lower difference, we might be more likely to suspect that there is now difference and the interval calculations, so in this context the test and the interval calculation become powerful.
Our 95% confidence interval would return an interval of [0.76 to 2.46], and we would reject our null hypothesis obtaining a z-score of 3.11 being the critical value 0.85.
The example above with sample sizes of 150 which, by many statisticians, allow us to apply the z-table. Now, let’s run a short example with n < 30 using t-statistics.
In our Diet test example above, we worked with sample sizes of 150 each. The p-value for the difference between the means was nearly zero, so our sample returned an extremely high probability that Diet A gives greater weight loss than Diet B.
Let’s use Excel and with the function Random Number Generation (in Data >> Data Analysis) to generate two groups with each 150 random variables. The variables return slightly different means and variances compared to our Diet example above, but to continue our rhetoric and story, say that these are two new samples that we are doing for our Diet A and Diet B:
We see that the z-test, based on all 2 times 150 samples, return a p-value of 0.00. Extending with more decimals (which I forgot to in the screenshot), we will see that these p-value, like in our Diet example above, are ‘very close’ to zero.
t-test with n1=11 & n2=13
Running a t-test, sampling the first 11 of Group 1 and the first 13 of Group 2, returns a p-value of 0.01622. Testing with a significance level (α) of 0.01, this would make us fail to reject H0, and we could not state that there should be any difference in weight loss between Diet A and Diet B.
t-test with n1=8 & n2=10
In the second t-Test (the table to the right), the observations are reduced to the first 8 of Group 1 and the first 10 of Group 2. This leads to an even greater p-value of 0.07603. So, any hypothesis test with α < 0.076 would make us fail to reject to H0 meaning that our samples do not give evidence that Diet A is more efficient than Diet B.
Other options for comparing two means
As mentioned, the population variance σ² is almost never known, and the formulas that we use above are therefore almost never applicable in real life cases. Two options that we can chose when population variance is unknown:
- The pooled variance t-procedure
- The Welch t-procedure (unpooled variance)
The pooled t-procedure assumes equal population variances, and results in an exact t-distribution.
The Welch unpooled t-procedure can feel more comfortable to begin with because it does not assume equal population variances, but it does not result in an exact t-distribution. I run through these procedures in Pooled variance t-procedure and Welch t-procedure.
Other options when we do not assume normality are:
- Mann-Withney U, or Wilcoxon rank-sum test
- Bootstrap methods and permutation tests
Using Excel for comparing two means
We can do z-tests and t-tests for the comparing of two means in Excel via Data >> Data Analysis and then choose as to whether you work with known or unknown variance. The options are.
For example, in this chapter we are working with known z-tests and t-tests for known variance, so we would be working with the functions:
- “z-Test: Two Sample for Means”
- “t-Test: Two-Sample Assuming Equal Variances”
Below, we have two groups with each sample sizes of 150 and we therefore apply a z-test:
Another example is the one listed above in the t-statistics exercise, where we do one test for sample sizes 11 and 13 and another for sample sizes of 8 and 10:
Alternatively, we can work out our own Excel setup:
Some of my preferred material for learning about comparing two means:
- Video (6:20): Inference for two means
- Video (10:07): The sampling distribution of the difference in sample means
- Khan Academy:
Freelance Data Analyst
+34 616 71 29 85
Spain: Ctra. 404, km 2, 29100 Coín, Malaga
Denmark: c/o Musvitvej 4, 3660 Stenløse
Drop me a line
What are you working on just now? Can I help you, and can you help me?
Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.
Connect with me
What they say
20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.