TYPES OF CORRELATION.
MEASURING OF CORRELATION.
Essence of correlation
Correlation is a measure of mutual correspondence between two variables and is denoted by the coefficient of correlation.
Applications and characteristics
a) The simple correlation coefficient, also called the Pearson's product-moment correlation coefficient, is used to indicate the extent that two variables change with one another in a linear fashion.
b) The correlation coefficient can range from - 1 to + 1 and is unites (Fig. A, B, C).
c) When the correlation coefficient approaches - 1, a change in one variable is more highly, or strongly, associated with an inverse linear change (i.e., a change in the opposite direction) in the other variable (Fig.A).
d) When the correlation coefficient equals zero, there is no association between the changes of the two variables (Fig.B).
(e) When the correlation coefficient approaches +1, a change in one variable is more highly, or strongly, associated with a direct linear change in the other variable (Fig.C).
A correlation coefficient can be calculated validly only when both variables are subject to random sampling and each is chosen independently.
Although useful as one of the determinants of scientific causality, correlation by itself is not equivalent to causation.
For example, two correlated variables may be associated with another factor that causes them to appear correlated with each other.
A correlation may appear strong but be insignificant because of a small sample size.
Table 2.12
Correlation connection
There are the following types of communication (relation) between the phenomena and signs in nature:
а) the reason-result connection is the connection between factors and phenomena, between factor and result signs.
б) the dependence of parallel changes of a few signs on some third size.
The quantitative types of connection: functional one is the connection, at which the strictly defined value of the second sign answers to any value of one of the signs (for example, the certain area of the circle answers to the radius of the circle); correlation connection is the connection, at which a few values of one sign answer to the value of every average size of another sign associated with the first one (for example, it is known that the height and mass of man’s body are linked between each other; in the group of persons with identical height there are different valuations of mass of body, however, these valuations of body mass varies in certain sizes – round their average size).
Correlation is a concept, which means the interconnection between the signs.
Correlative connection foresees the dependence between the phenomena, which do not have clear functional character.
Correlative connection is showed up only in the mass of supervisions that is in totality. The establishment of correlative connection foresees the exposure of the causal connection, which will confirm the dependence of one phenomenon on the other one.
Correlative connection by the direction (the character) of connection can be direct and reverse. The coefficient of correlation, that characterizes the direct communication, is marked by the sign plus (+), and the coefficient of correlation, that characterizes the reverse one, is marked by the sign minus (-).
By the force the correlative connection can be strong, middle, weak, it can be full and it can be absent.
THE SCHEME OF THE ESTIMATION OF CORRELATIve CONNECTION BY THE COEFFICIENT OF CORRELATION
The force of connection |
Line (+) |
Reverse (-) |
Complete |
+1 |
|
Strong |
From +1 to +0,7 |
From -1 to -0,7 |
Middle |
from +0,7 to +0,3 |
from –0,7 to –0,3 |
Weak |
from +0,3 to 0 |
from –0,3 to 0 |
The connection is absent |
0 |
0 |
The correlative connection can be:
1. By the direction
- direct (+) – with the increasing of one sign increases the middle value of another one;
- reverse (-) – with the increasing of one sign decreases the middle value of another one;
2. By the character
- rectilinear - relatively even changes of middle values of one sign are accompanied by the equal changes of the other (arterial pressure minimal and maximal)
- curvilinear – at the even change of one sing there can be the increasing or decreasing middle values of the other sign.
Мethods of determination of the coefficient of correlation:
The coefficient of correlation (rху) by one number gives the picture of the direction and force of connection between the explored phenomena.
The method of squares or the Pyrson’s method is frequently used for the determination of coefficient of correlation
, where:
х and y are the signs, between which the connection is determined
dx and dy are the deviation of each variants from their middle arithmetic, calculated in a number of sign х and in a number of sign y (Мх and Му)
∑ is the sum of signs.
The second method of determination of the coefficient of correlation is the method of grades or the Speerman’s method. It is used, when n<30 and if it is enough to have only oriented information for the estimation of character (direction) and the force of connection.
, where:
х and y are the signs, between which the connection is determined,
6 is a permanent coefficient,
d is a difference of grades,
n is a number of supervisions.
The determination of the error of the coefficient of grade correlation (that was determined by the Speerman’s method) and criterion t.
and
A criterion must be 2 and more, so that Р = 95,5 % and more.
Confidence of correlation coefficient
Criteria t should be 3 that corresponds to
probability of mistakes prognosis (p) ≥ 99,7%
Student's tests (t) based on the t distribution, which reflects greater variation dye to chance than the normal distribution are used to analyze small samples.
The (t) distribution is a continuous, symmetrical, unimodal distribution of infinite range, which is bell-shaped, similar to the shape of the normal distribution, but more spread out.
As the sample size increases, the t distribution closely resembles the normal distribution. At infinite degrees of freedom, the t and normal distributions are identical, and the t values equal the critical ratio values.
Table 2.13 Table of Critical ratio (abbreviated)
Probability that Value |
Lies |
|
|
|||||
Critical ration |
Within critical |
The ratio |
Within ± the Critical ratio |
Outside ± the Critical ratio |
||||
1.0 |
.341 |
|
.683 |
.317 |
|
|||
1.645 |
.450 |
|
.900 |
.100 |
|
|||
1.96 |
.475 |
|
.950 |
.050 |
|
|||
2.0 |
.477 |
|
.945 |
.046 |
|
|||
2.567 |
.495 |
.990 |
.010 |
|||||
3.0 |
.499 |
.997 |
.003 |
|||||
|
|
|
|
|||||
Fig. 2. The standardized normal distribution shown with the percentage of values included between critical ratios from the mean.
A. Student's test for a single small sample
Student's t test for a single small sample compares a single sample with a population.
Student's t tests are used to evaluate the null hypothesis for continuous variables for sample sizes less than 30.
The t table.
Probability values are derived from the t value and the number of degrees of freedom by using the t table for each degree of freedom, a row of increasing t values corresponds to a row of decreasing probabilities for accepting the null hypothesis the value of the t.
Confidence intervals.
In small samples, especially sample sizes less than 30, the t distribution is used to calculate confidence intervals around the sample mean.
The (t) table (abbreviated)
Probability |
|||
Degrees of freedom (df) |
.10 |
.05 |
.01 |
1 |
6.31 |
12.71 |
63.66 |
2 |
2.92 |
4.30 |
9.93 |
8 |
1.86 |
2.31 |
3.36 |
9 |
1.83 |
2.26 |
3.25 |
10 |
1.81 |
2.23 |
3.17 |
|
1.64 |
1.96 |
2.58 |
The method of grade correlation is used, when: there is the small quantity of observations, the exact calculations are not needed; the variation rows are opened, verbal expression of sign (for example, the diagnosis of disease)
The order of determination of grade correlation coefficient:
1) make the variation rows from pair signs
2) replace every value of variant by a grade (index) number
3) define the difference of grades: d = x – y
4) bring difference of grades to the square – d2
5) get the sum squares of difference of grades - ∑d2
6) define rxy a formula
7) define the direction and force of connection
8) define the error of mrxy and the criterion t and estimate the authenticity of faultless prognosis - p
Standard error of a mean
1. The standard error of a measure is based on sample of a population and is the estimate of the standard deviation of the measure for the population.
2. The standard error of a mean, one of the most commonly used types of standard error, is a measure of the accuracy of the sample mean as an estimate of the population mean. In comparison, the standard deviation is a measure of the variability of the observations.
Applications
a) The standard error of the mean is used to construct confidence limits around a sample mean.
b) Standard errors are used in Student's t test.
Confidence limits of a mean
The upper and lower confidence limits define the range of probability, that is, the confidence interval, for a measure of the population based on a measure of a sample and the measure's standard error.
1. Confidence intervals are expressed in terms of probability based on the error.
2. The confidence limits of a mean define the confidence interval for the population mean based on a sample mean.
For large samples, confidence limits are based on the critical ratio for the associated probability.
For a 95% confidence interval, the estimated sampling error is multiplied by 1.96;the chances are 95% (19 out of 20) that the interval includes the average result of all possible samples of the same size.
For small samples (less than 30), confidence limits are based on the t value for the number of degrees of freedom and the associated probability.
Applications and characteristics
(a) Confidence limits of a mean are used to estimate a population mean based on a sample from the population. The confidence interval is the margin of error of the point estimate.
(b) A repeated random sample from the population will yield another point estimate similar to, but not necessarily the same as, the first sample. The 95% confidence interval probably will cluster in the same area.
(c) The most commonly used confidence limits are 95,5% confidence limits, which indicate that there is a 95,5% probability that the population mean lies within the upper and lower confidence limits and a 5% probability that it lies outside these limits (p = 0.05).
Screening
Screening is the initial examination of an individual to detect disease not yet under medical care. Screening may be concerned with a single disease or with many diseases (called multiphase screening).
1. Purpose. Screening separates apparently healthy an individuals into groups with either a high or low probability of developing the disease for which the screening test is used.
2. Types of diseases. Screening may be concerned with many different types of diseases, including:
a) Acute communicable diseases (e.g., rubella)
b) Chronic communicable diseases (e.g., tuberculosis)
c) Acute noncommunicable diseases (e.g., lead toxicity)
d) Chronic noncommunicable diseases (e.g., glaucoma)
THE COEFICIENT OF DETERMINATION
In earlier chapters we have been concerned with the statistical analysis of observations on a single variable. In some problems data were divided into two groups, and the dichotomy could, admittedly, have been regarded as defining a second variable. These two-sample problems are, however, rather artificial examples of the relationship between two variables.
In this chapter we examine more generally the association between two quantitative variables. We shall concentrate on situations in which the general trend is linear; that is, as one variable changes the other variable follows on the average a trend which can be represented approximately by a straight line.
The basic graphical technique for the two-variable situation is the scatter diagram, and it is good practice to plot the data in this form before attempting any numerical analysis. An example is shown in Fig. 7.1. In general the data refer to a number of individuals, each of which provides observations on two variables. In the scatter diagram each variable is allotted one of the two coordinate axes, and each individual thus defines a point, of which the coordinates are the observed values of the two variables. In Fig. 7.1 the individuals are towns and the two variables are the infant mortality rate and a certain index of overcrowding.
The scatter diagram gives a compact illustration of the distribution of each variable and of the relationship between the two variables. Further statistical analysis serves a number of purposes. It provides, first, numerical measures of some of the basic features of the relationship, rather as the mean and standard deviation provide concise measures of the most important features of the distribution of a single variable. Secondly, the investigator may wish to make a prediction of the value of one variable when the value of the other variable is known. It will normally be impossible to predict with complete certainty, but we may hope to say something about the mean value and the variability of the predicted variable. From Fig. 7.1, for instance, it appears roughly that a town with 0-6 persons per room was in 1961 likely to have an infant mortality rate of about 20 per 1000 live births on average, with a likely range of about 14-26. A proper analysis might be expected to give more reliable figures than these rough guesses.
Fig. 7.1 Scatter diagram showing the mean number of persons per room and the infant mortality per 1000 live births for the 83 county boroughs in England and Wales in 1961.
Thirdly, the investigator may wish to assess the significance of the direction of an apparent trend. From the data of Fig. 7.1, for instance, could it safely be asserted that infant mortality increases on the average as the overcrowding index increases, or could the apparent trend in this direction have arisen easily by chance?
Yet another aim may be to correct the measurements of one variable for the effect of another variable. In a study of the forced expiratory volume (FEV) of workers in the cadmium industry who had been exposed for more than a certain number of years to cadmium fumes, a comparison was made with the FEV of other workers who had not been exposed. The mean FEV of the first group was lower than that of the second. However, the men in the first group tended to be older than those in the second, and FEV tends to decrease with age. The question therefore arises whether the difference in mean FEV could be explained purely by the age difference. To answer this question the relationship between FEV and age must be studied in some detail.
We must be careful to distinguish between association and causation. Two variables are associated if the distribution of one is affected by a knowledge of the value of the other. This does not mean that one variable causes the other. There is a strong association between the number of divorces made absolute in the United Kingdom during the first half of this century and the amount of tobacco imported (the ‘individuals’ in the scatter diagram here being the individual years). It does not follow either that tobacco is a serious cause of marital discontent, or that those whose marriages have broken down turn to tobacco for solace. Association does not imply causation.
A further distinction is between situations in which both variables can be thought of as random variables, the individuals being selected randomly or at least without reference to the values of either variable, and situations in which the values of one variable are deliberately selected by the investigator. An example of the first situation would be a study of the relationship between the height and the blood pressure of schoolchildren, the individuals being restricted to one sex and one age group. Here, the sample may not have been chosen strictly at random, but it can be thought of as roughly representative of a population of children of this age and sex from the same area and type of school. An example of the second situation would arise in a study of the growth of children between certain ages. The nature of the relationship between height and age, as illustrated by a scatter diagram, would depend very much on the age range chosen and the distribution of ages within this range.
The Regression Model; Analysis of Residuals
Before we can perform statistical inferences in regression and correlation, we must know whether the variables under consideration satisfy certain conditions. In this section, we discuss those conditions and examine methods for deciding whether they hold.
The Regression Model
Let’s return to the Orion illustration used throughout Chapter 4. In Table 14.1, we reproduce the data on age and price for a sample of 11 Orions.
With age as the predictor variable and price as the response variable, the regression equation for these data is ˆy = 195.47 − 20.26x, as we found in Chapter 4 on page 153. Recall that the regression equation can be used to predict the price of an Orion from its age. However, we cannot expect such predictions to be completely accurate because prices vary even for Orions of the same age.
For instance, the sample data in Table 14.1 include four 5-year-old Orions. Their prices are $8500, $8200, $8900, and $9800. We expect this variation in price for 5-year-old Orions because such cars generally have different mileages, interior conditions, paint quality, and so forth.
We use the population of all 5-year-old Orions to introduce some important regression terminology. The distribution of their prices is called the conditional distribution of the response variable “price” corresponding to the value 5 of the predictor variable “age.” Likewise, their mean price is called the conditional mean of the response variable “price” corresponding to the value 5 of the predictor variable “age.” Similar terminology applies to the standard deviation and other parameters.
Of course, there is a population of Orions for each age. The distribution, mean, and standard deviation of prices for that population are called the conditional distribution, conditional mean, and conditional standard deviation, respectively, of the response variable “price” corresponding to the value of the predictor variable “age.”
The terminology of conditional distributions, means, and standard deviations is used in general for any predictor variable and response variable. Using that terminology, we now state the conditions required for applying inferential methods in regression analysis.
Note: We refer to the line y = β0 + β1x—on which the conditional means of the response variable lie—as the population regression line and to its equation as the population regression equation.
The inferential procedures in regression are robust to moderate violations of Assumptions 1–3 for regression inferences. In other words, the inferential procedures work reasonably well provided the variables under consideration don’t violate any of those assumptions too badly.
Assumptions for Regression Inferences
Age and Price of Orions
For Orions, with age as the predictor variable and price as the response variable, what would it mean for the regression-inference Assumptions 1–3 to be satisfied? Display those assumptions graphically.
Solution
Satisfying regression-inference Assumptions 1–3 requires that there are constants β0, β1, and σ so that for each age, x, the prices of all Orions of that age are normally distributed with mean β0 + β1x and standard deviation σ. Thus the prices of all 2-year-old Orions must be normally distributed with mean β0 + β1 · 2 and standard deviation σ, the prices of all 3-year-old Orions must be normally distributed with mean β0 + β1 · 3 and standard deviation σ, and so on.
To display the assumptions for regression inferences graphically, let’s first consider Assumption 1. This assumption requires that for each age, the mean price of all Orions of that age lies on the line y = β0 + β1x, as shown in Fig. 14.1.
Assumptions 2 and 3 require that the price distributions for the various ages of Orions are all normally distributed with the same standard deviation, σ. Figure 14.2 illustrates those two assumptions for the price distributions of 2-, 5-, and 7-year-old Orions. The shapes of the three normal curves in Fig. 14.2 are identical because normal distributions that have the same standard deviation have the same shape.
Assumptions 1–3 for regression inferences, as they pertain to the variables age and price of Orions, can be portrayed graphically by combining Figs. 14.1 and 14.2 into a three-dimensional graph, as shown in Fig. 14.3. Whether those assumptions actually hold remains to be seen.
Estimating the Regression Parameters
Suppose that we are considering two variables, x and y, for which the assumptions for regression inferences are met. Then there are constants β0, β1, and σ so that, for each value x of the predictor variable, the conditional distribution of the response variable is a normal distribution with mean β0 + β1x and standard deviation σ.
Because the parameters β0, β1, and σ are usually unknown, we must estimate them from sample data.We use the y-intercept and slope of a sample regression line as point estimates of the y-intercept and slope, respectively, of the population regression line; that is, we use b0 and b1 to estimate β0 and β1, respectively. We note that b0 is an unbiased estimator of β0 and that b1 is an unbiased estimator of β1.
Equivalently, we use a sample regression line to estimate the unknown population regression line. Of course, a sample regression line ordinarily will not be the same as the population regression line, just as a sample mean generally will not equal the population mean. In Fig. 14.4, we illustrate this situation for the Orion example. Although the population regression line is unknown, we have drawn it to illustrate the difference between the population regression line and a sample regression line.
In Fig. 14.4, the sample regression line (the dashed line) is the best approximation that can be made to the population regression line (the solid line) by using the sample data in Table 14.1 on page 551. A different sample of Orions would almost certainly yield a different sample regression line.
The statistic used to obtain a point estimate for the common conditional standard deviation σ is called the standard error of the estimate.
Linear regression
Suppose that observations are made on variables x and y for each of a large number of individuals, and that we are interested in the way in which y changes on the average as x assumes different values. If it is appropriate to think of y as a random variable for any given value of x, we can enquire how the expectation of y changes with x. The probability distribution of y when x is known is referred to as a conditional distribution, and the conditional expectation is denoted by E(y|x). We make no assumption at this stage as to whether x is a random variable or not. In a study of heights and blood pressures of randomly chosen individuals both variables would be random; if x and y were respectively the age and height of children selected according to age, then only y would be random.
The conditional expectation, E(y|x), depends in general on x. It is called the regression function of y on x. If E(y|x) is drawn as a function of x it forms the regression curve. Two examples are shown in Fig. 7.2. First, the regression in Fig. 7.2(b) differs in two ways from that in Fig. 7.2(a). The curve in Fig. 7.2(b) is a straight line—the regression line of y on x. Secondly, the variation of y for fixed x is constant in Fig. 7.2(b), whereas in Fig. 7.2(a) the variation changes as x increases. The regression in (b) is called homoscedastic, that in (a) being hetero- scedastic.
a) b)
Fig. 7.2 Two regression curves of y on x: (a) non-linear and heteroscedastic; (b) linear and homo- scedastic. The distributions shown are those of values of y at certain values of x
The situation represented by Fig. 7.2(b) is important not only because of its simplicity, but also because regressions which are approximately linear and homoscedastic occur frequently in scientific work. In the present discussion we shall make one further simplifying assumption—that the distribution of y for given x is normal.
The model may, then, be described by saying that, for a given x, y follows a normal distribution with mean
E(y I x) = a + px
(the general equation of a straight line) and variance ct2 (a constant). A set of data consists of n pairs of observations, denoted by (xi, yi), (x2, y2), ..., (xn, yn), each y, being an independent observation from the distribution N(a + px,, ct2). How can we estimate the parameters a, p and ct2, which characterize the model?
An intuitively attractive proposal is to draw the regression line through the n points on the scatter diagram so as to minimize the sum of squares of the distances, yt — Y, , of the points from the line, these distances being measured from the y-axis (Fig. 7.3). This proposal is in accord with theoretical arguments leading to the least squares estimators of a and p, a and b, namely the values which minimize the residual sum of squares^ ^(yt — Yi)2, where Yi is given by the estimated regression equation
|
Fig. 7.4 Scatter diagram showing the birth weight, x, and the increase of weight between 70 and 100 days as a percentage of x, for 32 babies, with the two regression lines
Inferences in Correlation
Frequently, we want to decide whether two variables are linearly correlated, that is, whether there is a linear relationship between the two variables. In the context of regression, we can make that decision by performing a hypothesis test for the slope of the population regression line.
Alternatively, we can perform a hypothesis test for the population linear correlation coefficient, ρ (rho). This parameter measures the linear correlation of all possible pairs of observations of two variables in the same way that a sample linear correlation coefficient, r, measures the linear correlation of a sample of pairs. Thus, ρ actually describes the strength of the linear relationship between two variables; r is only an estimate of ρ obtained from sample data.
The population linear correlation coefficient of two variables x and y always lies between −1 and 1. Values of ρ near −1 or 1 indicate a strong linear relationship between the variables, whereas values of ρ near 0 indicate a weak linear relationship between the variables. Note the following: _ If ρ = 0, the variables are linearly uncorrelated, meaning that there is no linear relationship between the variables.
If ρ > 0, the variables are positively linearly
correlated, meaning that y tends to increase linearly as x increases
(and vice versa), with the tendency being greater the closer ρ is to 1.
If ρ < 0, the variables are negatively linearly
correlated, meaning that y tends to decrease linearly as x increases
(and vice versa), with the tendency being greater the closer ρ is to −1.
If ρ ≠ 0,
the variables are linearly correlated. Linearly
correlated variables are either positively linearly correlated or negatively
linearly correlated. As we mentioned, a sample linear correlation coefficient, r , is an estimate of the
population linear correlation coefficient, ρ.
Consequently, we can use r as a basis for performing
a hypothesis test for ρ. To do so, we require the
following fact.
Correlation and Causality
A major goal of many statistical studies is to determine whether one factor causes another. For example, does smoking cause lung cancer? In this unit, we will discuss how statistics can be used to search for correlations that might suggest a cause-and-effect relationship. Then we’ll explore the more difficult task of establishing causality.
Seeking Correlation
What does it mean when we say that smoking causes lung cancer? It certainly does not mean that you’ll get lung cancer if you smoke a single cigarette. It does not even mean that you’ll definitely get lung cancer if you smoke heavily for many years, since some heavy smokers do not get lung cancer. Rather, it is a statistical statement meaning that you are much more likely to get lung cancer if you smoke than if you don’t smoke.
Let’s try to understand how researchers learned that smoking causes lung cancer. Before they could investigate cause, researchers first needed to establish correlations between smoking and cancer. The process of establishing correlations began with observations. The early observations were informal. Doctors noticed that smokers made up a surprisingly high proportion of their patients with lung cancer. This suggestion of a linkage led to carefully conducted studies in which researchers compared lung cancer rates among smokers and nonsmokers. These studies showed clearly that heavier smokers were more likely to get lung cancer. In more formal terms, we say that there is a correlation between the variables amount of smoking and incidence of lung cancer. A correlation is a special type of relationship between variables, in which a rise or fall in one goes along with a corresponding rise or fall in the other.
Establishing a correlation between two variables does not mean that a change in one variable causes a change in the other. Thus, finding the correlation between smoking and lung cancer did not by itself prove that smoking causes lung cancer. We could imagine, for example, that some gene predisposes a person both to smoking and to lung cancer. Nevertheless, identifying the correlation was the crucial first step in learning that smoking causes lung cancer.
Time out to think
Suppose there really were a gene that made people prone to both smoking and lung cancer. Explain why we would still find a strong correlation between smoking and lung cancer in that case, but would not be able to say that smoking caused lung cancer.
Scatter Diagrams
Table 5.6 shows the production cost and gross receipts (total revenue from ticket sales) for the 15 biggest-budget science fiction and fantasy movies of all time (through mid-2006). Movie executives presumably hope there is a favorable correlation between the production budget and the receipts. That is, they hope that spending more to produce a movie will result in higher box office receipts. But is there such a correlation? We can look for a correlation by making a scatter diagram showing the relationship between the variables production cost and gross receipts.
The following procedure describes how we make the scatter diagram, which is shown in Figure 5.40:
1. We assign one variable to each axis, and we label each axis with values that comfortably fit the data. Here, we assign production cost to the horizontal axis and gross receipts to the vertical axis. We choose a range of $50 to $250 million for the production cost axis and $0 to $450 million for the gross receipts axis.
2. For each movie in Table 5.6, we plot a single point at the horizontal position corresponding to its production cost and the vertical position corresponding to its gross receipts. For example, the point for the movie Waterworld goes at a position of $175 million on the horizontal axis and $88 million on the vertical axis. The dashed lines on Figure 5.40 show how we locate this point.
3. (Optional) If we wish, we can label data points, as is done for selected points in Figure 5.40.
Time out to think
By studying Table 5.6, associate each of the unlabeled data points in Figure 5.40 with a particular movie.
Types of Correlation
Look carefully at the scatter diagram for movies in Figure 5.40. The dots seem to be scattered about with no apparent pattern. In other words, at least for these big-budget movies, there appears to be little or no correlation between the amount of money spent producing the movie and the amount of money it earned in gross receipts.
Now consider the scatter diagram in Figure 5.41, which shows the weights (in carats) and retail prices of 23 diamonds. Here, the dots show a clear upward trend, indicating that larger diamonds generally cost more. The correlation is not perfect. For example, the heaviest diamond is not the most expensive. But the overall trend seems fairly clear. Because the prices tend to increase with the weights, we say that Figure 5.41 shows a positive correlation.
In contrast, Figure 5.42 shows a scatter diagram for the variables life expectancy and infant mortality in 16 countries. We again see a clear trend, but this time it is a negative correlation: Countries with higher life expectancy tend to have lower infant mortality.
Besides stating whether a correlation exists, we can also discuss its strength. The more closely the data follow the general trend, the stronger is the correlation.
EXAMPLE
1 Inflation and
Unemployment
Prior to the 1990s, most economists assumed that the unemployment rate and the inflation rate were negatively correlated. That is, when unemployment goes down, inflation goes up, and vice versa. Table 5.7 shows unemployment and inflation data for the period 1990–2006. Make a scatter diagram for these data. Based on your diagram, does it appear that the data support the historical claim of a link between the unemployment and inflation rates?
SOLUTION
We make the scatter diagram by plotting the variable unemployment rate on the horizontal axis and the variable inflation rate on the vertical axis. To make the graph easy to read, we use values ranging from 3.5% to 8% for the unemployment rate and from 0 to 6% for the inflation rate. Figure 5.43 shows the result. To the eye, there does not appear to be any obvious correlation between the two variables. (A calculation confirms that there is no appreciable correlation.) Thus, these data do not support the historical claim of a negative correlation between the unemployment and inflation rates.
EXAMPLE 2 Accuracy of Weather Forecasts
The scatter diagrams in Figure 5.44 show two weeks of data comparing the actual high temperature for the day with the same-day forecast (left diagram) and the three day forecast (right diagram). Discuss the types of correlation on each diagram.
SOLUTION
Both scatter diagrams show a general trend in which higher predicted temperatures mean higher actual temperatures. Thus, both show positive correlations. However, the points in the left diagram lie more nearly on a straight line, indicating a stronger correlation than in the right diagram. This makes sense, because we expect weather forecasts to be more accurate on the same day than three days in advance.
Possible Explanations for a Correlation We began by stating that correlations can help us search for cause-and-effect relationships. But we’ve already seen that causality is not the only possible explanation for a correlation. For example, the predicted temperatures on the horizontal axis of Figure 5.44 certainly do not cause the actual temperatures on the vertical axis. The following box summarizes three possible explanations for a correlation.
EXAMPLE 3 Explanation for a Correlation
Consider the correlation between infant mortality and life expectancy in Figure 5.42. Which of the three possible explanations for a correlation applies? Explain.
SOLUTION
The negative correlation between infant mortality and life expectancy is probably an example of common underlying cause. Both variables respond to an underlying variable that we might call quality of health care. In countries where health care is better in general, infant mortality is lower and life expectancy is higher.
EXAMPLE 4 How to Get Rich in the Stock Market (Maybe)
Every financial advisor has a strategy for predicting the direction of the stock market. Most focus on fundamental economic data, such as interest rates and corporate profits. But an alternative strategy relies on a remarkable correlation between the Super Bowl winner in January and the direction of the stock market for the rest of the year: The stock market tends to rise when a team from the old, pre-1970 NFL wins the Super Bowl, and tends to fall otherwise. This correlation successfully matched 28 of the first 32 Super Bowls to the stock market. Suppose that the Super Bowl just ended and the winner was the Detroit Lions, an old NFL team. Should you invest all your spare cash (and maybe even some that you borrow) in the stock market?
SOLUTION
Based on the reported correlation, you might be tempted to invest, since the old-NFL winner suggests a rising stock market over the rest of the year. However, this investment would make sense only if you believed that the Super Bowl result actually causes the stock market to move in a particular direction. This belief is clearly preposterous, and the correlation is undoubtedly a coincidence. If you are going to invest, don’t base your investment on this correlation.
ESTABLISHING CAUSALITY
Suppose you have discovered a correlation and suspect causality. How can you test your suspicion? Let’s return to the issue of smoking and lung cancer. The strong correlation between smoking and lung cancer did not by itself prove that smoking causes lung cancer. In principle, we could have looked for proof with a controlled experiment. But such an experiment would be unethical, since it would require forcing a group of randomly selected people to smoke cigarettes. So how was smoking established as a cause of lung cancer?
The answer involves several lines of evidence. First, researchers found correlations between smoking and lung cancer among many groups of people: women, men, and people of different races and cultures. Second, among groups of people that seemed otherwise identical, lung cancer was found to be rarer in nonsmokers. Third, people who smoked more and for longer periods of time were found to have higher rates of lung cancer. Fourth, when researchers accounted for other potential causes of lung cancer (such as exposure to radon gas or asbestos), they found that almost all the remaining lung cancer cases occurred among smokers.
These four lines of evidence made a strong case, but still did not rule out the possibility that some other factor, such as genetics, predisposes people both to smoking and to lung cancer. However, two additional lines of evidence made this possibility highly unlikely. One line of evidence came from animal experiments. In controlled experiments, animals were divided into randomly chosen treatment and control groups. The experiments still found a correlation between inhalation of cigarette smoke and lung cancer, which seems to rule out a genetic factor, at least in the animals. The final line of evidence came from biologists studying cell cultures (that is, small samples of human lung tissue). The biologists discovered the basic process by which ingredients in cigarette smoke can create cancer-causing mutations. This process does not appear to depend in any way on specific genetic factors, making it all but certain that lung cancer is caused by smoking and not by any preexisting genetic factor.
The following box summarizes these ideas about establishing causality. Generally speaking, the case for causality is stronger when more of these guidelines are met.
Time out to think
There’s a great deal of controversy concerning whether animal experiments are ethical. What is your opinion of animal experiments? Defend your opinion.
CASE STUDY Air Bags and Children
By the mid-1990s, passenger-side air bags had become commonplace in cars. Statistical studies showed that the air bags saved many lives in moderate- to high-speed collisions. But a disturbing pattern also appeared. In at least some cases, young children, especially infants and toddlers in child car seats, were killed by air bags in low-speed collisions.
At first, many safety advocates found it difficult to believe that air bags could be the cause of the deaths. But the observational evidence became stronger, meeting the first four guidelines for establishing causality. For example, the greater risk to infants in child car seats fit Guideline 3, because it indicated that being closer to the air bags increased the risk of death. (A child car seat sits on top of the built-in seat, thereby putting a child closer to the air bags than the child would be otherwise.)
To seal the case, safety experts undertook experiments using dummies. They found that children, because of their small size, often sit where they could be easily hurt by the explosive opening of an air bag. The experiments also showed that an air bag could impact a child car seat hard enough to cause death, thereby revealing the physical mechanism by which the deaths occurred.
CASE STUDY What Is Causing Global Warming?
Statistical measurements show that the global average temperature—the average temperature everywhere on Earth’s surface—has risen about 1.5°F in the past century, with more than half of this warming occurring in just the past 30 years. But what is causing this so-called global warming?
Scientists have for decades suspected that the temperature rise is tied to an increase in the atmospheric concentration of carbon dioxide and other greenhouse gases. Comparative studies of Earth and other planets, particularly Venus and Mars, show that the greenhouse gas concentration is the single most important factor in determining a planet’s average temperature. It is even more important than distance from the Sun. For example, Venus, which is about 30% closer than Earth to the Sun, would be only about 45°F warmer than Earth if it had an Earth-like atmosphere. But because Venus has a thick atmosphere made almost entirely of carbon dioxide, its actual surface temperature is about 880°F—hot enough to melt lead. The reason greenhouse gases cause warming is that they slow the escape of heat from a planet’s surface, thereby raising the surface temperature.
In other words, the physical mechanism by which greenhouse gases cause warming is well understood (satisfying Guideline 6 on our list), and there is no doubt that a large rise in carbon dioxide concentration would eventually cause Earth to become much warmer. Nevertheless, as you’ve surely heard, many people have questioned whether the current period of global warming really is due to humans or whether it might be due to natural variations in the carbon dioxide concentration or other natural factors.
In an attempt to answer these questions, the United States and other nations have devoted billions of dollars over the past two decades to an unprecedented effort to understand Earth’s climate. We still have much more to learn, but the research to date makes a strong case for human input of greenhouse gases as the cause of global warming. Two lines of evidence make the case particularly strong.
The first line of evidence comes from careful measurements of past and present carbon dioxide concentrations in Earth’s atmosphere. Figure 5.45 shows the data. Notice that past changes in the carbon dioxide concentration correlate clearly with temperature changes, confirming that we should expect a rising greenhouse gas concentration to cause rising temperatures. Moreover, while the past data show that the carbon dioxide concentration does indeed vary naturally, it also shows that the recent rise is much greater than any natural increase during the past several hundred thousand years. Human activity is the only viable explanation for the huge recent increase in carbon dioxide concentration.
The second line of evidence comes from experiments. We cannot perform controlled experiments with our entire planet, but we can run experiments with computer models that simulate the way Earth’s climate works. Earth’s climate is incredibly complex, and many uncertainties remain in attempts to model the climate on computers. However, today’s models are the result of decades of work and refinement. Each time a model of the past failed to match real data, scientists sought to understand the missing (or incorrect) ingredients in the model and then tried again with improved models. Today’s models are not perfect, but they match real climate data quite well, giving scientists confidence that the models have predictive value. Figure 5.46 compares model data and real data, showing good agreement and clearly suggesting that human activity is the cause of global warming. If you include the effects of the greenhouse gases put into the atmosphere by humans, the models agree with the data, but if you leave out these effects, the models fail.
Time out to think
Check the idea that human activity causes global warming against each of the six guidelines for establishing causality.
Confidence in Causality
If human activity is causing global warming, we’d be wise to change our activities so as to stop it. But while we have good reason to think that this is the case, not everyone is yet convinced. Moreover, the changes needed to slow global warming might be very expensive. How do we decide when we’ve reached the point where something like global warming requires steps to address it?
In an ideal world, we would continue to study the issue until we could establish for certain that human activity is the cause of global warming. However, we have seen that it is difficult to establish causality and often impossible to prove causality beyond all doubt. We are therefore forced to make decisions about global warming, and many other important issues, despite remaining uncertainty about cause and effect.
In other areas of mathematics, accepted techniques help us deal with uncertainty by allowing us to calculate numerical measures of possible errors. But there are no accepted ways to assign such numbers to the uncertainty that comes with questions of causality. Fortunately, another area of study has dealt with practical problems of causality for hundreds of years: our legal system. You may be familiar with the following three broad ways of expressing a legal level of confidence.
Time out to think
Given what you know about global warming, do you think that human activity is a possible cause, probable cause, or cause beyond reasonable doubt? Defend your opinion. Based on your level of confidence in the causality, how would you recommend setting policies with regard to global warming?
References:
1. David Machin. Medical statistics: a textbook for the health sciences / David Machin, Michael J. Campbell, Stephen J Walters. – John Wiley & Sons, Ltd., 2007. – 346 p.
2. Nathan Tintle. Introduction to statistical investigations / Nathan Tintle, Beth Chance, George Cobb, Allan Rossman, Soma Roy, Todd Swanson, Jill VanderStoep. – UCSD BIEB100, Winter 2013. – 540 p.
3. Armitage P. Statistical Methods in Medical Research / P. Armitage, G. Berry, J. Matthews. – Blaskwell Science, 2002. – 826 p.
4. Larry Winner. Introduction to Biostatistics / Larry Winner. – Department of Statistics University of Florida, July 8, 2004. – 204 p.
5. Weiss N. A. (Neil A.) Elementary statistics / Neil A. Weiss; biographies by Carol A. Weiss. – 8th ed., 2012. – 774 p.