Comparing two proportions
Comparing two proportions is often seen during election periods where it can applied when e.g. comparing how two groups of people vote for a party. For example, are women more likely to vote a certain party than men?
Except for a few twists, I will use the example that Salman Khan give us in this video: Comparing population proportions 1:
Say we are in the election period and we wish to know if men are more likely to vote for a certain party than women.
Out of the 900 men that we sample, 584 (=65%) express that they will vote for the party.
Out of the 1100 women that we sample, 651 (=59%) express that they will vote for the party.
The distributions of sampled men and women
This is a Bernouli distribution where the mean equals the sample proportion. The variance equals the success rate (=the sample proportion) times the failure rate (the ones not voting for the party). This is the p(1-p), which can loosely be expressed as the ‘yes-proportion’ times the ‘no-proportion’:
The sampling distribution of proportion means
As we have two relatively large sample sizes and proportions that are relatively far from 0 and from 1, the sampling distributions become approximately normally distributed:
The mean of the sampling distribution of the sample proportion (µp̄) = the population proportion (p̄).
The sampling distribution of the difference
The distribution of the differences of the sampling means will have a mean = the differences of our sampling means, and they = to the difference of our sampling proportions.
The variance of the sampling distribution of the differences of means = variance for p2 + variance for p2.
Using a confidence interval to compare
Now, that we have identified and described the sampling distributions and the means, variances and standard deviations, we can start working on our confidence interval for the difference in the sampling means.
A confidence interval associates a degree of uncertainty to our point estimate which, in our case, is 0.057. Often, one of the main interests that we have when doing a confidence interval for the difference is to see if zero is included in the interval.
If zero is included in our 95% confidence interval we will not have evidence, in our hypothesis test, to reject H0, and therefore, there is no evidence to support that there really might be a difference. The difference might be zero.
The confidence interval formula is similar to the “usual” confidence interval structure where we add and subtract the margin of error (ME) with estimated mean difference:
In the figure above we saw the formula for the variance and the standard deviation of the difference, and we can therefore calculate the standard deviation for the difference like this:
Sorry for using different notations! Here I denote the standard deviation ‘σd’. The ‘d’ is the difference, which above, I denoted p1-p2.
The critical value for a 95% confidence interval is 1.96 and having calculated the estimated sample variance of the difference, we can now plug this 0.022 into the confidence interval formula:
We get a confidence interval that spans from 0.014 to 0.100. This means that we can be 95% confident that the true difference between the population parameters P1 and P2 is minimum 0.014. Hence, zero is not included in the interval and there is therefore most likely sufficient evidence to support our alternative hypothesis saying that there is a difference between the proportion means. P1 minus P2 does not seem to be zero.
Hypothesis test for the sample mean difference
The test will answer to the following question: How likely are we to get that an extreme a result, as the one we got in our sample, assuming that there really is no difference between the proportions? Can we reject the null hypothesis stating that there is no difference between the two sample proportions?
We have a large sample size and sample means relatively far from zero, so we will assume normality and apply z-statistics:
Where p1 and p2 are assumed to be equal, so that parenthesis is zero and q = p(1-p), which, in our example are the proportion of the persons who say ‘No’. It is the proportion that does not say ‘Yes’.
The estimated p-hat and q-hat are:
We can now calculate our z-statistics
The critical z-value at a significance level (α) of 0.05 is 1.96, so with our test statistic of 2.613 we reject the null hypothesis. There is not evidence to support that the two proportions should be equal.
Comparing two proportions with MS Excel
Below, a screenshot of how comparing of two proportions can be done in Excel. The z-test function in Data >> Data Analysis is another option, although the ranges of the observations are needed.
Learnings on comparing two proportions
My preferred material for learning theory on comparing two proportions:
- Khan Academy
- Video 15:09 min.: Including confidence interval and hypothesis test: Introduction to inference for two proportions
- Video 13:22 min.: Inference for two proportions: An example of a confidence interval and a hypothesis test
Freelance Data Analyst
+34 616 71 29 85
Spain: Ctra. 404, km 2, 29100 Coín, Malaga
Denmark: c/o Musvitvej 4, 3660 Stenløse
Drop me a line
What are you working on just now? Can I help you, and can you help me?
Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.
Connect with me
What they say
20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.