Correlation and Linear Regression
Correlation and linear regression are related statistical techniques that examine the association between two numeric variables. (If your variables are both discrete, see Contingency table analysis on the Chi-square tests page.) The data can either be continuously distributed or discrete as long as they have a normal distribution.
- Continuously distributed numeric variables are ones that, in principle, can take an infinite number of values if measured precisely enough - for example: body mass, height, nitrogen concentration in a water sample, or cholesterol level in the bloodstream.
- Discrete numeric variables are ones that can take only a certain set of values - for example: the number of leaves on a tree; the number of bacterial colonies on a petri dish. Both these variables can take only integer values, although the number of possible values is very large.
To learn how to determine whether your data have a normal distribution, see Foundational Material. If your data are not normally distributed, alternative tests are available: see Nonparametric Statistics.
In what ways do the techniques differ?
- Correlation analysis is used to describe the strength of a linear association between two numeric variables and whether the association is positive or negative. Linear means that the data show a straight line relationship in a scatterplot. Correlation analysis cannot accurately describe the strength of non-linear (curved) relationships. A positive association means that as one variable increases, the other also increases. A negative association means that as one variable increases, the other decreases.
- Linear regression also describes the strength and direction of a linear relationship between two numeric variables, but is also used to predict the value of one of the variables (the Y-variable) from the other (the X-variable). Correlation analysis is not capable of doing the latter.
- If you are certain that one of the variables has a causal influence on the other (as might be the case in a well-controlled experiment), regression is the better technique. Under these circumstances, the X-variable is often called the independent (or predictor) variable, and the Y-variable the dependent (or response) variable
- Regression is also appropriate if you want to make predictions about the value of one variable from the value of the other even if you have no reason to believe that the relationship is causal.
- Correlation analysis should be used if you have no good reason to believe that the relationship between the variables is causal and if you are only interested in determining the strength and direction of the association. Correlation is often the more appropriate technique for non-experimental data when cause-effect relationships are uncertain.
Below is an Excel spreadsheet containing data on the total number of flowers, and the average number of stamens, ovules, and petal mass of 38 prairie larkspur Delphinium virescens plants. Stamens are the male reproductive structures of a flower, responsible for pollen grain production. Ovules are female structures that when fertilized by a pollen grain, mature into seeds. The botanist who collected these data was interested in knowing whether the size and number of the three types of floral structure (petals, stamens, and ovules) were associated – in other words, whether plants that had flowers with large petals also produced many (or few) ovules, or many (or few) pollen grains, per flower. There was no reason to believe that the relationship between these variables was causal (i.e., that having large petals caused a flower also to produce many ovules), so correlation analysis, rather than linear regression was the appropriate method for these data.
The null hypothesis for these tests is that there are no correlations between mean petal mass, stamen number and ovule number per flower.
The analysis can be performed in JMP as follows:
- Import the data into JMP and check that the variables of interest (Mean Stamens, Mean Ovules, and Mean Petal Mass are all coded as continuous variables, as indicated by the blue triangles in the column list to the left.
- From the Analyze menu select Multivariate Methods then Multivariate. In the window that appears click or drag the three variables of interest (Mean Stamens, Mean Ovules, and Mean Petal Mass) into the Y, Columns box and hit OK.
- In the new window, click on the red arrow next to Multivariate and select Pairwise Correlations and also Nonparametric Correlations – Spearman’s ρ.
- Click on the red arrow next to Scatterplot Matrix and select Show Correlations and Show Histogram-Horizontal.
- The window should now look like this:
Click here to download a high-resolution image of this figure: Larkspur flowers 2005 JMP output download
- The Correlations table shows a set of pairwise correlations between all three variables. Each number in the table is a Pearson product-moment correlation coefficient (often just called Pearson’s r), which is an appropriate measure of the correlation between numeric variables when both have a normal distribution.
- Pearson’s r ranges from -1 (a perfect negative correlation, with all the data points falling on a straight line) to 0 (no correlation at all between the variables) to +1 (a perfect positive correlation).
- Note that the table is symmetrical about the diagonal (i.e., the same information is given twice) and that the diagonal values are all 1.000 (i.e., every variable is perfectly correlated with itself!)
- The Scatterplot Matrix shows scatterplots of all the data for each pairwise correlation. Pearson’s r values are shown in each scatterplot. The red ellipses show the 95% bivariate normal density ellipses. If both variables are normally distributed, 95% of the data should fall within these ellipses. As with the Correlations table, the Matrix is symmetrical about the diagonal and each scatterplot is shown twice but with the X and Y variables reversed above and below the diagonal. Which variable is the X and which the Y for each scatterplot is indicated by the variable names in the diagonal boxes, which also show histograms of the data for each variable. In all three cases the histograms are symmetrical and bell shaped, indicating that the data are approximately normally distributed. To learn how to carry out formal statistical tests for normality, see Foundational Material.)
- The Pairwise Correlations table shows a list of the three Pearson’s r values (the same as in the Correlations table), along with the sample size (Count) for each analysis, the 95% confidence limits for each correlation, and the p-value (Signif Prob) for the test of each null hypothesis. In this case the correlations between (1) ovule number and stamen number and (2) petal mass and stamen number are not statistically significant (p = 0.46 and p = 0.30 respectively; both above the normal threshold of 0.05) whereas the correlation between petal mass and ovule number is statistically significant (p = 0.0075). The figure on the right of the table shows each of the correlation coefficients in visual form relative to the maximum values of -1 and +1.
- Based on these analyses we cannot reject the null hypothesis of no association between variables for (1) ovule number and stamen number and (2) petal mass and stamen mass. However, we are justified in rejecting the null hypothesis and accepting the alternative hypothesis of a correlation between (3) petal mass and ovule number.
- When reporting the results of correlation analyses in a paper, you would normally report the Pearson’s r value, the sample size, and the p-value for each test. For example, in the Results section, you might say “There was a significant correlation between petal mass and ovule number (Pearson’s r = 0.427, n = 38, p < 0.001)”. If you had performed multiple analyses these would normally be reported in a table.
- The Nonparametric: Spearman’s ρ table shows a list of alternative correlation coefficients (Spearman’s rank correlation or Spearman’s ρ) and associated p-values that can be reported if the variables do not both have a normal distribution. In this case you can see that there is broad agreement between the Pearson’s r and Spearman’s ρ coefficients and p-values.
Below is an Excel spreadsheet containing data on the sizes of 65 oak seedlings planted in 2015 as part of a habitat restoration. In addition to information on the transect that the seedlings were planted along and a numeric identifier for each seedling there are three size measures: the stem diameter at 10 cm above the ground, the height of the stem (old wood height) in the previous year (2016) and the number of leaves in the measurement year (2017). The ecologists studying these seedlings were interested in knowing whether the height of the plant in 2016 influenced leaf production in 2017. Linear regression is a more appropriate method than correlation analysis here because the cause-effect relationship, if it exists, can only be in one direction: height in 2016 could, in principle, influence leaf production in 2017, the reverse is not possible, even in principle. Moreover, even if there is no cause-effect relationship, it is still worth knowing whether leaf production in a particular year can be predicted from the size of the seedling the previous year.
The null hypothesis for this test is that there is no linear relationship between the number of leaves produced in 2017 and old wood height in 2016. If the null hypothesis is true, the data will form a scatterplot with no positive (upwards) or negative (downwards) trend: i.e., the fitted regression line will have a slope of zero.
The analysis can be performed in JMP as follows:
- Import the data into JMP and check that the variables of interest (Old wood height (cm) and # Leaves) are both coded as continuous variables, as indicated by the blue triangles in the column list to the left. Note that old wood height really is a continuous variable (because it can, in principle, take an infinite number of values), whereas the number of leaves is a numeric, but not continuous, variable (because it can take only integer values). However, as long as the data are numeric and have a normal distribution linear regression can be performed.
- From the Analyze menu, select Fit Y by X and click or drag Old wood height into the X, Factor box and # Leaves into the Y, Response box.
- Notice the image in the bottom left of this window. The Fit Y by X option can perform four basic types of analysis, depending on whether your X and Y variables are continuous (as shown by a blue triangle) or categorical (as shown by the red and green histograms). In this case, both the X variable (Old wood height) and the Y variable (# Leaves) are continuous, so Fit Y by X automatically carries out a Bivariate analysis, of which linear regression is one particular type.
- Hit OK. A scatterplot of the data will appear with Old wood height on the X-axis and # Leaves on the Y-axis.
- Click on the red arrow next to Bivariate Fit… and select Histogram Borders, Fit Mean, and Fit Line. Then click on the red arrow next to Linear Fit below the scatterplot and select Confid Curves Fit. The window will now look like this:
Click here to download a high-esolution image of this figure: SW Oaks Leaves 2017 JMP Output download
The statistical output provides the following information:
- The histograms along the edges of the scatterplot show the distributions of the X and Y variables. Here, both variables show a symmetrical bell-shaped curve which appears to be approximately normal, so we will proceed with the linear regression analysis. (To learn how to carry out formal statistical tests for normality, see Foundational Material.)
- In the scatterplot, the horizontal green line (the color may differ in your output) shows the overall mean value of the Y-variable (# Leaves). This is the null hypothesis for the statistical test (i.e., a line of no trend in the data).
- The solid red line (again the color may differ) shows the best fit linear regression to the data; the dashed red lines indicate the 95% confidence limits for the slope of this line. The fact that these 95% confidence limits do not enclose the green line of zero slope is a visual indicator that the null hypothesis can be rejected.
- The Linear Fit table shows the mathematical equation for the fitted regression line. Remember that a straight line can be described by the equation
Y = a + bX
- Where a is the intercept on the Y-axis and b is the slope of the line. So here, the intercept is estimated to be 11.329 and the slope is estimated to be 0.697 (both estimated values are positive).
- In the Summary of Fit table the most useful pieces of information are
- RSquare = 0.281
- Observations = 65
- RSquare (more properly written as R2) indicates the proportion of the variation in the Y-variable that is explained by variation in the X-variable. As a proportion it ranged from 0 to 1, with larger values indicating a better fit between the data and the regression line. Here the R2 value is 0.28 (28%), indicating a moderately good fit of the data to the line. See Foundational Material for more detail on R2 values.
- Observations is simply the sample size (number of pairs of data)
- The Analysis of Variance table shows a statistical test of the null hypothesis, called an F-test. Here we will simply note that the p-value (Prob > F) is < 0.0001, much less than the normal threshold for significance (0.05). This means that we can reject the null hypothesis and accept the alternative hypothesis that there is a relationship between the two variables. The trend is far stronger than can be explained by chance (sampling variation) alone. See One Factor Analysis of Variance for a detailed explanation of how to interpret an Analysis of Variance table.
- The Parameter Estimates table shows the estimated values for the two parameters of the linear regression equation (the intercept and the slope) and the standard errors of these estimates, along with t-tests (t-Ratio) and p values (Prob>|t|) for the null hypotheses that both these parameters are zero. You can see that both p-values are less than the normal threshold for significance (0.05), so we can reject both null hypotheses and accept the alternative hypotheses that the parameters are different from zero. The deviations from zero are far greater than can be explained by chance (sampling variation) alone.
- When reporting the results of a linear regression analysis you would normally provide the following information, either in the text of the Results section or (if you had performed multiple analyses) in a table:
- The regression line equation: Y = 11.329 + 0.697 X
- The sample size: n = 65
- The t-value and associated p-value for the null hypothesis that the slope is zero: t = 4.97, p < 0.0001 (or equivalent values for the F-test)
- Optionally, depending on whether it is of interest or not, the t-value and associated p-value for the null hypothesis that the intercept is zero: t = 2.29, p = 0.025
- Note that the t-test and the F-test are alternative ways of testing the null hypothesis that the slope of the regression line is zero. You can report either, but there is no need to report both.