Chi-square Analysis

Please use these links to navigate around the page

INTRODUCTION

PERFORMING A GOODNESS-OF-FIT TEST

PERFORMING A CONTINGENCY TABLE ANALYSIS

 

INTRODUCTION

Chi-square tests have a variety of purposes, but most commonly they are used to analyze counts (frequencies) of categorical data – data that can be assigned to a small number of discrete categories. Sometimes there are just two categories (male vs. female; dead vs. alive), sometimes several (British, French, or German; red, yellow, blue, or green). Such data are typically analyzed in one of two ways:

 

Goodness-of-fit tests are used to compare the frequencies in a data sample with the frequencies expected based on some prior expectation – either empirical or theoretical. For example:

  1. You might want to compare the number of males vs. females in your sample with the known sex-ratio in the population as a whole (normally very close to 1:1). Here your prior expectation would be empirical.
  2. You might want to compare the number of Drosophila fruit flies with a dominant vs. recessive phenotype. If your sample was taken from the F2 generation of a cross between two pure-breeding strains in the parental generation (a monohybrid cross), you would expect the dominant:recessive ratio to be 3:1. Here your prior expectation would be theoretical, being derived from an understanding of Mendelian genetic inheritance.

 

Contingency table analysis is used to compare the frequencies in two or more data samples to one another, rather than to a prior expectation. For example:

  1. You might have randomly collected 100 plants from each of two meadows and counted the number of red vs. pink flowered individuals, with the goal of determining whether the frequencies of the two colors differed between meadows.
  2. You might have experimentally infected 40 mice with a bacterial pathogen, treated 20 mice with one antibiotic and the other 20 with a different antibiotic, with the goal of determining whether the number of mice that recovered differed between the two antibiotics.

Both types of analysis involve calculating a statistic called a chi-square value. This is normally given the symbol χ2 (because χ is the symbol for the Greek letter chi – pronounced “ki” to rhyme with “eye”--although in the font on this website it doesn’t look much different from a capital X). The term is written out in English in this document for ease of reading, but when describing the results of a chi-square test in a report or paper, you would normally use the following format

χ2 = 1.354, d.f. = 1, p = 0.245

Where d.f. is the degrees of freedom for the test and p is the p-value. See below for more details.

‌‌A word on sample sizes

Chi-square tests are generally considered to be unreliable if the expected values in any of your data categories are less than five. Alternative tests are available (exact binomial tests for goodness-of-fit and Fisher’s exact tests for contingency tables), but it would be better to have decent sample sizes.

Back to the top

 

PERFORMING A GOODNESS-OF-FIT TEST

Goodness-of-fit tests are easy to do with a calculator or in an Excel spreadsheet. In fact, they are often used to teach the basic principles of statistical analysis, and you may have already done them in an introductory biology course. They can also be done very quickly in JMP. The hand calculation method is shown below, using Drosophila phenotypes as an example. If you want to know how to use JMP, scroll down.

Hand calculation

In a goodness-of-fit test, the chi-square value is an indicator of how much the observed frequencies deviate from expected frequencies. It is calculated as follows:

  1. For each category of data, subtract the expected value from the observed value (O-E). This gives you a measure of the size of the deviation from expectation.
  2. Square each of these deviations, to calculate (O-E)2 The purpose of this is to convert negative deviations into positive values (because the square of a negative number is always positive). This allows deviations to be summed at the end of the calculation, rather than cancelling each other out.
  3. Divide each squared deviation by the expected value for that category, to calculate (O-E)2/E. The purpose of this is to obtain a measure of the size of the deviation relative to what is expected. For example, if the expected value is 20, a deviation of 4 is large (20%), whereas if the expected value is 200, a deviation of 4 is small (2%).
  4. Finally sum the (O-E)2/E values across all categories of data. This sum is the chi-square value for your data set.

This sounds a little complicated, but is actually very easy, especially if you use Excel to do the calculations for you, as the following table of Drosophila phenotype frequencies shows.

GoF calculations

Several things are worth pointing out in this table

  1. The expected values are based on a total of 207 flies in a 3:1 ratio, which is the ratio expected in the F2 generation of a monohybrid cross: 207 x 0.75 = 155.250; 207 x 0.25 = 51.750. Expected values will obviously differ from test to test depending on the data you are analyzing.
  2. The sum of expected values should always equal the sum of observed values.
  3. It is normally best to work to three decimal places for precision.
  4. d.f. stands for degrees of freedom, which is an indicator of the amount of data you have. (See Foundational Material for a more extensive discussion of degrees of freedom). For a goodness-of-fit test, the degrees of freedom is equal to the number of categories of data (dominant, recessive) minus one (2-1 = 1). If there had been four categories of data, there would have been 3 d.f.
  5. The p-value indicates whether the observed deviation from expectation (as measured by the chi-squared value) is statistically significant or not (See Foundational Material for a more extensive discussion of p-values). P-values < 0.05 are conventionally regarded as being statistically significant, indicating that the deviation is probably real and meaningful, rather than being just due to chance events when you sampled your flies. In this case the p-value is > 0.05, indicating that the deviation is probably due to chance, and therefore that your observed data are consistent with what you would expect in the F2 generation of a monohybrid cross.
  6. How was the p-value obtained? You can either look it up in a statistical table (See Foundational Material for details) or by asking Excel to calculate the value for you. To do this, use the following Excel formula

=chidist(F4,F5)

Where F4 is the cell number containing the chi-square value and F5 is the cell number containing the degrees of freedom. Obviously, these cell numbers will differ from Excel spreadsheet to spreadsheet, depending on how you organize your data.

Back to the top 

 

USING JMP FOR A GOODNESS-OF-FIT TEST (MAC AND PC)

The simplest method of performing a goodness-of-fit test in JMP is to create two columns in your data table - one with the category names and one with the frequencies. Using the Drosophila data described above, your table would look like this, with the first row being the column names. JMP should automatically make Phenotype a categorical variable (indicated by a red histogram in the column list on the left) and Frequency a continuous variable (indicated by a blue triangle). A second way of organizing your data, which might be more appropriate for some data sets, is described at the end of this section. 

Phenotype

Frequency

Dominant

148

Recessive

59

In the Analyze menu, select Distribution, and in the window that appears drag Phenotype into the Y, Columns box and Frequency into the Freq box, then hit OK. A new window showing the frequencies in histogram form will appear. If you want, you can turn this histogram sideways by clicking on the red arrow beside Distributions and selecting Stack.

Now click on the red arrow beside Phenotype and select Test Probabilities. The window will expand. You can now enter your expected proportions into the boxes below the column called Hypoth Prob – in this case 0.75 for the dominant phenotype and 0.25 for the recessive (a 3:1 ratio, as expected for the F2 generation of this type of cross). This is your statistical null hypothesis.

Now select your alternative hypothesis from the list below. The default is the first one, which is the one you want: probabilities not equal to the hypothesized value (two-sided chi-square test). Then hit Done.

The entire window will now look like this

JMP GoF simon version

 

The middle table shows the data, expressed as both numbers (Count) and proportions (Prob). Somewhat confusingly, JMP uses Prob for proportions as well as for the p-values for the statistical tests: be sure not to confuse these in your own mind. (JMP uses Prob for proportions because if 71.498% of your flies have the dominant phenotype, a randomly chosen fly from your entire sample has a 0.71498 probability of being dominant.)

If you wish, you can click on the red arrow next to Phenotype, choose Histogram Options, and add a vertical axis scale (either counts or probabilities) as well as show the counts (or percentages) above each bar.

The table on the right compares the observed (Estim Prob) and expected (Hypoth Prob) proportions of the two phenotypes at the top. Below that is an output table, listing two statistical tests (Likelihood Ratio, Pearson), along with the chi-square value, degrees of freedom (DF) and probability value (Prob>Chisq) for each.

When people use the term chi-square test, they normally mean a particular version of the test called the Pearson chi-square test. This is what your hand calculation method calculates. So, in this particular example

  • the chi-square value = 1.354
  • the degrees of freedom = 1
  • the p-value = 0.245.

Since the p-value is greater than 0.05 we can conclude that the difference between the observed data and the theoretical expectation is not statistically significant, and so is probably due to chance events that occurred when you sampled your flies. If the p-value had been < 0.05, JMP would have highlighted this value in red.

 

A second way to organize your data

In some cases, it might be more appropriate to organize your data not as a summary table but with each individual on a separate line. For example, if you had weighed as well as recorded the phenotype of each fly, your data would look like this. 

Fly

Phenotype

Mass (mg)

1

Dominant

0.264

2

Dominant

0.218

3

Recessive

0.226

4

Dominant

0.217

5

Recessive

0.283

6

Dominant

0.255

7

Dominant

0.229

8

Dominant

0.237

In this case, you should select Distribution as before, then simply drag Phenotype into the Y, Columns box and proceed as described above.

 Back to the top

 

PERFORMING A CONTINGENCY TABLE ANALYSIS

Contingency table analysis can also be done by hand, but it can be time-consuming to do so, especially if the number of categories of data is large.  It is much faster to carry out in JMP. The term “contingency table analysis” is used because the data can be organized in the form of a table and the analysis is designed to determine whether the proportions of one variable in the table are contingent on (depend on) another variable. Using the flower color example described earlier, the table would look like this, with column proportions in parentheses after the frequencies. 

 

Meadow A

Meadow B

Red flowered plants

68 (0.68)

55 (0.55)

Pink flowered plants

32 (0.32)

45 (0.45)

Here the question of interest is whether the proportion of red vs. pink flowered plants is contingent on which meadow they were sampled from.

 

 

USING JMP FOR A CONTINGENCY TABLE ANALYSIS (MAC AND PC)

As with goodness-of-fit tests, the data can either be organized as a summary table, or with each individual on a separate line. The summary table approach is shown first, using the example above. In this example, the data consist of a 2x2 table with two categories for each of two variables (red vs pink flowers; meadow A vs. meadow B). Contingency table analysis can be performed on tables of any size (2x3; 2x4; 2x5; 3x3; etc.) but interpretation of the analysis becomes increasingly difficult as the tables get larger, particularly when there are three or more different variables and/or three or more categories for one or more of the variables.

Set up the data in an Excel or JMP spreadsheet as follows. 

Flower color

Meadow

Frequency

Red

A

68

Pink

A

32

Red

B

55

Pink

B

45

One important thing to note about this table is that the Flower color and Meadow variables should categorical, not numeric. If you had numbered your meadows 1 and 2 rather than called them A and B, JMP would define the Meadow variable as continuous (and indicate this with a blue triangle like the frequency variable). It would then not be able to perform a contingency table analysis.

With the data organized as shown above, in the Analyze menu choose Fit Y by X. In the window that appears, drag Flower color into the Y, Response box, Meadow into the X, Factor box, and Frequency into the Freq box.

Notice the image in the bottom left of this window. The Fit Y by X option can perform four basic types of analysis, depending on whether your X and Y variables are continuous (as shown by a blue triangle) or categorical (as shown by the red and green histograms). In this case, both variables are categorical, so Fit Y by X automatically carries out a Contingency analysis.

Hit OK. An analysis window will appear. Click on the red arrow next to Contingency Table and select Expected. The window will now look like this:

 JMP contingency output

To download a high resolution PDF of the above image click on the following link:  JMP contingency table

 

The main features of this window are as follows: 

  • The Mosaic Plot at the top is a colored, visual presentation of the data, showing the proportions of red and pink flowers in each of the two meadows. The width of the meadow columns reflects the relative sample sizes in each meadow. In this case sample sizes are both 100, so the columns are of equal width. If sample sizes in the two meadows had been different, the column widths would have been different too. The narrow bar on the right shows the overall proportions of red vs. pink flowers in the entire data set.
  • The Contingency Table shows the observed data, expressed either as numbers (top row), as a percentage of the total sample of 200 (second row), as a percentage of each column (third row) or as a percentage of each row (fourth row). The final row shows the expected numbers of flowers of each color if the proportions of red and pink flowers were equal in the two samples, given the total number of flowers of each color (71 pink and 129 red).
  • The Tests list shows several different statistical analyses of the data. The degrees of freedom (DF) is given in the first row. For a contingency table the degrees of freedom is calculated as

(# rows -1) x (# columns -1)

In this case there are two rows and two columns, so DF = 1

  • As with the goodness-of-fit test described earlier, the test called Pearson is the one that is normally reported as a chi-square test. In this case the chi-square value is 7.883 and the associated p-value (Prob>ChiSq) is 0.0050. It is highlighted in orange because it is less than 0.05, indicating statistical significance. This means that the difference in the proportions of red vs. pink flowers in your two samples probably reflects a real difference between the two meadows, rather than being just due to chance events that occurred during your sampling procedure.

 

 

The other way to organize your data

If you organized your data with one line per individual, it would look like this 

Flower

Meadow

Color

1

A

Red

2

A

Red

3

A

Red

4

A

Pink

5

B

Red

6

B

Red

7

B

Pink

8

B

Pink

In this case, you would select Fit Y by X as before, drag Color into the Y, Response box, Meadow into the X, Factor box (with no variable in the Freq box) and then Hit OK.

Back to the top