+34 616 71 29 85 carsten@dataz4s.com

Chi-square test

The Chi-square test is also called “Goodness of fit”, as it compares fit of the observed sample data with the expected data. The chi-square test analyzes the dependence between different categorical datasets. Example: Do more women than men vote for some political party?

 

 

Key points for chi-square test

  • The Chi-square distribution is applied when testing for dependence between categorical datasets
  • It works with discrete and mutually exclusive data

 

Chi-square test worked example

Say that a HR department, as part of ongoing training program, is running a periodical test among the employees on their product knowhow. A multiple-choice test is used for the purpose. The test has four possible answers: A, B, C and D. The test producers claim that there is an equal probability that the correct answer is either four of these.

Some of the HR staffs get curious and wish to test the truth of this claim: Is the probability of a correct answer really equal between A, B, C and D?

We would define the following chi-square hypotheses:

H0: The correct choices are equally distributed (A: 25%, B: 25%, C: 25%, D: 25%)

H1: The correct choices are not equally distributed

Let’s set a significance level (α) of 0.05.

 

Purpose of the chi-square hypothesis test

The Chi-square test is a hypothesis tests and follow the same procedure and concepts, so we will reject the null hypothesis in case our p-value is lower than our significance level (α). Rejecting the null hypothesis, in our case, will mean that we reject that the probabilities can be equally distributed between the four options A, B, C and D.

 

Expected vs observed data

HR take a sample of 100 tests randomly selected from the past years of testing with this method. As the null hypothesis says that the correct answers are equally distributed, we expect 25 correct answers for each of the four questions. These expected values are compared to the once observed in the sample and we can make the following contingency table: 

Chi-square test example

Now, how can we calculate if the result that we got from our sample is more extreme that what our significant level allows for?  How can we know if our sample result is “statistically significant”? We apply the Chi-square distribution:

 

Chi-square distribution

With the Chi-square test we can calculate for dependence or independence between different categories, in our case between A, B, C and D. It works with mutually exclusive data, meaning that if for a question the correct choice is A, then it cannot be D at the same time.

Data in the Chi-square distribution is countable and therefore discrete. It is countable data. We can count each question and choice as a whole integer

The Chi-square distribution is denoted with the Greek letter Chi squared: ꭓ2. To calculate the Chi-squared statistics, we calculate the sum of the squared differences between the observed and the expected values. This value is the related to the expected value. Thus, we get the formula for Chi-square statistics:

Chi-square test statistic formula

 

The Chi-square formula explained

To find the distance between the observed and the expected, we subtract the expected value from the observed. This is also called “residual”.

The differences are squared in order to obtain only positive values and are divided by the expected value in order to normalize independently of the number of counts. Otherwise, the Chi-square statistic would increase with the number of counts, so for large datasets we would get large statistics. This is the idea about normalizing or standardizing (ref. the Z-score chapter).

The operation is carried out for each count, or in our situation, for each row and then added up. The adding up of each count is expressed by the large sigma in front of the formula:

Chi-square test statistic calculation

Our calculated value, or our test statistic is 6.0. This is now tested against the corresponding value from the chi-square table which is found by looking up under the degrees of freedom:

 

Degrees of freedom

The degrees of freedom (df) is the number of category values, or cells in our table, that are independent. If we have the totals, and four values that add up to the total, filling in 3 cells will let us know what the fourth value is.

For example, in our table, for the observed data, we would know that the value for D must be equal to 20, knowing that A+B+C+D = 100. Knowing the values for A, B and C and that they, together with D will add up to 100, tells that D must be 20. So, D, in this case, is not free to vary. A degree of freedom is lost. 

Degrees of freedom

 So, the values for A, B and C can be any values, but D must be the missing puzzle that makes all four add up to 100. In mathematical terms this expresses that A, B and C are free to vary. They are independent and free to vary and therefore express the degrees of freedom.

The degrees of freedom for Chi-square tables, like the table in our example, is (Row-1) × (Column-1). We express this as r-1, c-1. In our table, we have four rows and two columns, so our degree of freedom is (4-1) × (2-1) = 3.

With the degrees of freedom and the significance level, you can look up the probability, or the p-value, for independence.

 

Looking up in the Chi-square table

To look up the p-value in the Chi-square distribution table, we look at the row of degrees of freedom (df) = 3 and follow the line to the column that corresponds with our significance level (α) of 5%.

 

Lookup in Chi-square table

  

At df=3 and α=0.05, we find a critical value of 7.81. Visualizing this with the Chi-square probability density curve for df=3 compared to our test statistic of 6.0:

 

Chi-square test statistics visualized in density curve

 

Chi-square test conclusion

We find a critical value of 7.81 which is greater than our 6.00. So, we fail to reject the null hypothesis concluding that, based on our sample results, we cannot reject that the choices are equally distributed.

We recall that we do not conclude that the H0 is the actual result. Failing to reject the H0 only means that we cannot reject that it could be true. We do not conclude that it is, in fact, true. In fact, our 6.00 is “pretty” close to the critical value of 7.28. And from the table, we can read that 6.00 is little more than 10%, because the 0.1 column at df=3 returns 6.25.

So, we get a p-value of a little greater than 10%. This means that there is “a little” more than 10% probability that we will get as extreme a result as the one we got at 6.0. Or expressed as: “There is + 10% chance of getting 6.00 or more”.

 

Visualizing multiple chi-square distributions

The following graph shows multiple chi-square distributions with each of their different degree of freedom:

 

Chi-square probability density curves

 

 

Chi-square test with MS Excel

The CHISQ.TEST and CHISQ.INV and CHISQ.DIST functions in Excel return values in the Chi-square distribution and available from off the Excel 2010 version and later.

 

CHISQ.TEST

The Excel function CHISQ.TEST conducts a Chi-square test on the array of observed values and on the array of expected frequencies. It returns the p-value and thereby the probability that our result is due to chance or sampling error.

CHISQ.TEST function in Excel

 

CHISQ.INV

The CHISQ.INV returns the critical value or the inverse of the left-tailed probability:

CHISQ.INV function in Excel

 

CHISQ.DIST(x,df,TRUE)

The Excel function CHISQ.DIST with the arguments (x,df,cumulative=TRUE) returns the cumulative distribution function.

CHISQ.DIST function in Excel

 

CHISQ.DIST(x,df,FALSE)

When cumulative set to ‘FALSE’ (x,df,cumulative=FALSE) it returns the probability density function. ‘x’ is the calculated test statistic which for Chi-square statistics is ∑(O-E)2/E

CHISQ.DIST function in Excel

 

 

Learning statistics

 

Carsten Grube

Carsten Grube

Freelance Data Analyst

17 Comments

  1. Bobbye

    Do you mind if I quote a few of your articles as long as I provide credit
    and sources back to your site? My blog site is in the exact same niche as yours and my users would
    genuinely benefit from a lot of the information you provide here.
    Please let me know if this alright with you. Thank
    you!

  2. promotion gratuite tik tok

    With havin so much written content do you ever run into
    any problems of plagorism or copyright violation? My site has a lot of unique content I’ve
    either authored myself or outsourced but it appears a lot of it is
    popping it up all over the internet without my agreement.
    Do you know any methods to help reduce content from being
    ripped off? I’d certainly appreciate it.

  3. comptes tiktok gratuits

    Hey would you mind sharing which blog platform you’re using?
    I’m looking to start my own blog soon but I’m having a hard
    time selecting between BlogEngine/Wordpress/B2evolution and Drupal.
    The reason I ask is because your design and style
    seems different then most blogs and I’m looking for something unique.
    P.S Apologies for being off-topic but I had to ask!

  4. aime tiktok gratuit

    Hi there, I found your blog by means of Google
    while looking for a related matter, your site came up, it looks good.

    I’ve bookmarked it in my google bookmarks.
    Hello there, just became aware of your blog via Google, and located that
    it’s really informative. I am gonna be careful for brussels.

    I will appreciate if you happen to proceed this in future.

    A lot of other folks will likely be benefited from your writing.
    Cheers!

  5. générateur d'abonnés tiktok gratuit

    Hello There. I found your blog using msn. This is
    an extremely well written article. I’ll make sure to bookmark it
    and come back to read more of your useful info. Thanks for the post.
    I’ll definitely return.

  6. Felix

    Having read this I thought it was really enlightening.

    I appreciate you taking the time and energy to put this content together.

    I once again find myself personally spending way too much time both reading and leaving comments.
    But so what, it was still worth it!

  7. tinder or gratuit novembre 2019

    I loved as much as you will receive carried out right here.
    The sketch is tasteful, your authored subject matter stylish.
    nonetheless, you command get got an impatience over that you wish be delivering
    the following. unwell unquestionably come further formerly again since
    exactly the same nearly very often inside case you shield this hike.

  8. inde sans or amadou

    Wonderful post! We are linking to this particularly great post on our site.

    Keep up the good writing.

  9. Brigette

    Usually I don’t learn post on blogs, however I wish to say that this write-up very forced me to take
    a look at and do it! Your writing style has been surprised
    me. Thank you, quite nice post.

  10. GENERATOREN

    I needed to thank you for this very good read!! I absolutely loved every little bit of it.
    I’ve got you book marked to look at new things you post…

  11. FORTNITE HACK V Bucks

    Hello there! I could have sworn I’ve been to
    this website before but after browsing through some of the post I realized it’s
    new to me. Nonetheless, I’m definitely happy I found it
    and I’ll be bookmarking and checking back often!

Submit a Comment

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

Drop me a line

What are you working on just now? Can I help you, and can you help me? 

About me

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children. 

What they say

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.