Home » CHARACTERISTIC AND ANALYSIS OF STATISTICAL DATA

CHARACTERISTIC AND ANALYSIS OF STATISTICAL DATA

June 16, 2024

Characteristic and analysis of statistical data. Averages and variation indexes.

Parametrical and nonparametrical methods of estimation and analysis of statistical hypotheses.

Analysis of CORRELation between indexes of statistical SAMPLES.

Average value determination

The average values, which give the generalized quantitative description of certain characteristic in statistical totality at the certain terms of place and time, are the most widespread form of statistical indices. They represent the typical lines of variation characteristic of the explored phenomena. Because of that quantitative description of characteristic is related to its high-quality side, it follows to examine average values only in light of terms of high-quality analysis. Except of summarizing estimation of certain characteristic the necessity of determination of changeable quantitative average values for the totality arises up also, when two groups which high-quality differ one from other are compared.

In practice of health protection averages are used widely enough:

– for description of work organization of health protection establishments (middle employment of bed, term of stay in permanent establishment, amount of visits on one habitant and other);

– for description of indices of physical development (length, mass of body, circumference of head of new-born and other);

– for determination of medical-physiology indices of organism (frequency of pulse, breathing, level of arterial pressure and other);

– for estimation of these medical-social and sanitary-hygienic researches (middle number of laboratory researches, middle norms of food ration, level of radiation contamination and others).

Averages are widely used for comparison in time, that allows to characterize the major conformities to the law of development of the phenomenon. So, for example, conformity to the law of growth increase of certain age children finds the expression in the generalized indices of physical development. Conformities to the law of dynamics (increase or diminishment) of pulse rate, breathing, clinical parameters at the certain diseases find the display in statistical indices which represent the physiology parameters of organism and other.

The most frequently at the study of medical-biological information are used:

-0 middle arithmetic;

-1 middle harmonious;

-2 middle geometrical.

In addition, practical application is found by summarizing descriptive descriptions of variant characteristic as Moda and Median.

Averages must be determined on the basis of mass generalization of facts and be used to the high-quality homogeneous aggregates. It is the basic condition of their practical and scientific use.

By an obligatory condition, which present statistical material for the calculation of averages must answer, there is also the sufficient number of supervisions. This criterion can be defined by formulas, which are represented in the section „Organization and conducting of statistical research”.

Separate elements (value) of aggregate of homogeneous objects, after high-quality composition, phenomena, parameters are variants, and all their aggregate can be represented as a variation row which is basis for determination of averages. The variation row is a row of variants and frequencies proper to them. The variation rows enable to set character of distributing of the aggregate units after that or other quantitative sign and its variation – variety of individual values of signs of concrete aggregate units.

The Moda is the variant, which has the most frequency. The fashion is used in those cases, when it is necessary to give description of the sign, which the most often meets in the explored aggregate. It is used only in large aggregates.

The obligatory condition to correspond the present statistical material to the calculation of the mean values is the sufficient number of supervisions.

In statistics the Median is a variant, which occupies a middle (central) position in the variation row. The median divides a row in half –there is the identical amount of aggregate units on the both sides of it.

Middle arithmetic value – the type of averages is most widespread after frequency of the use. She can be simple and weighed. For a simple variation row, the simple middle arithmetic is determined, which settles accounts as the relation of sum of values is a variant to the commoumber of supervisions.

Important properties of middle arithmetic:

-3 Work of middle to the amount of frequencies always equals the sum of work variant on frequency.

-4 If from each to take away variants some arbitrary number, the new middle will diminish on even dating.

-5 If to each to add variants some arbitrary number, will be multiplied middle on even dating. Second and third property of middle arithmetic show that at diminishment or increase a variant on the same number diminishes or is multiplied the level of sign on even dating.

-6 If each to the variant to divide into some arbitrary number, the middle arithmetic diminishes in so much times.

-7 If each to the variant to increase on some arbitrary number, is multiplied middle arithmetic in so much times.

-8 If to divide all frequencies (weight) or increase on some number, the middle arithmetic will not change hereupon – if we increase or diminish equivalently frequencies of all variant, we do not change weight of every separate variants of row.

-9 The sum of declinations variant from middle arithmetic always equals a zero. It means that in relation to middle arithmetic declinations are mutually paid off variant in that or other side.

-10 Common properties can be used, to facilitate the technique of determination of middle arithmetic variation row.

Table Average figures

Kinds of average figures

1. Mean (M)

that is generalized value, which gives characteristics by one of the number for occurrence, that may have many individual manifestations

2. Moda (Mo)

is a value of the feature, which is founded in totality most frequently

3. Median (Me)

is a value of the feature, which occupies medium position in variation line and divides it in two symmetrical parts

Methods of determination of mean

1. Simple arithmetical mean

is calculated when variants are met with equal frequency and value of observations (n) is ≤ 30

2. Weight of arithmetical mean

is calculated when variants are met with different frequency andis >30

3. Method of moments

is calculated when variants are expressed by great numbers and value of observations is expressed by hundreds and thousands of incidents

Properties of arithmetical mean

1. It occupies an average (middle) position in variation line

2. It has abstract character

3. Sum of deflection of all variants from medium value is equal to 0

The middle harmonious settles accounts in those cases, when information about a numerator in default of such in relation to a denominator is known.

The middle geometrical is determined for those parameters, the changes of values of which pass in geometrical progression (change of value of population in a period between censuses, results of vaccines, increase of mass of body of new-born during the separate months of life and other).

The second property of statistical totalities is the average level of the phenomenon that is studied.

A variation row is the row of the numerical measuring of certain, which differ each from an other in size and are located in the definite order. For example, students according to then growth.

Elements of variation row:

· variant – / V / it is every numerical meaning of characteristic which is studied;

· frequency – / р / it is the absolute value of separate variants in totality which specifies, how many times given variant is observed in a variation row.

Variation rows are simple and grouped. Simple variation row is a row, in which each of the variant is gistered separately, and the amount of supervisions is less than 30.

Construction of the grouped variation row:

1. Determination of amount of groups in a variation row.

2. Determination of duration of interval between groups /i/.

Vmax – Vmin

r number of groups

3. Determination of the beginning, middle and end of the group.

4. Assignment of the cases of supervisions to the groups.

5. Graphic reflection of variation row.

The middle level of characteristic is the second property of statistical totality.

Middle level gives summarizing description to different sizes of characteristic, expressing it by one number.

Examples: middle duration of stay of patient on a bed; middle-weight of new-born; middle frequency of breathing.

An average value is a number, which expresses the general measure of the explored characteristic in the totality.

Average values are parameters of the basic quality of the phenomena which are studied, or a unit of measurements of the central tendency of distribution a variant.

In all researches which concern a state of health, the great value has comparison of the phenomenon with the standard as which the average value acts.

Average values can be expressed by absolute or Relative values.

A Moda is the size of characteristic, which can be more frequently observed in this totality. Module or dominant – is a variant, which occurs in a variation line with the greatest frequency.

A Median is the size of characteristic, which occupies the middle position in a variation row.

Ме = at the odd amount of supervisions

Ме = at the paired amount of supervisions

Median – is a variant, which occupies median position. In a case if number of the variant is pair, the median represents average value of two variant, which occupy median position.

So, the phenomena which have the variant character are estimated by average number. For example, 60 patients had different duration of disease – from one till 45 days.

The list of duration of each case or variants of disease gives a certaiotion about this disease.

However, it is possible to characterize duration of current of this disease by one number, a so-called average parameter.

If to sum up duration of current of disease any of 60 fallen ill and to share the received sum into general number of patients we shall have the average duration of current of disease. For example, in our case it is equal to 22 days.

This number can be compared to average term of duration of other disease and to express, for example, concerning their complexity.

It is clear, that the greater term of current of disease is the more complex the treatment. So, the average number allows to characterize by one number the phenomenon which can have a set of an individual displays.

Average terms of stay of the patient concern to average number of beds, average cost of treatment of one patient, etc.

For an estimation of health of the population and activity of medical institutions it is possible to use both general or continuous, and selective statistical sets.

Certainly, general totality gives absolutely authentic result. The sample gives the approached result, but, under condition of correct selection of observation units, this result essentially does not differ from that general totality gives.

Taking into account this circumstance, and also huge economy of forces and means, it is necessary to use mainly selective statistical set.

Let’s stop on an example. It is necessary to research health, in particular morbidity, city dwellers where 200 thousands of people live. It is possible to take into account of any diseases, certainly. But the number of the following diseases will reach several hundreds of thousands.

Expediently from these 200 thousands to select the certain part of people, so that it will display qualitative structure of the population of all city.

Morbidity of this part will display essence of morbidity of city dwellers.

It is necessary to apply the certain kinds of sample which are recommended with a statistical science for this purpose.

Mechanical if each tenth or twentieth city dweller or behind so-called distribution of random numbers, say, is selected.

Thus all the inhabitants have the same probability to get iumber of elite, in fact sample will display age, sexual, professional and other qualitative type of the population.

Typological if general totality is preliminary broken into types (age, sexual, professional layers, etc.), and then from any the proportional part of people is taken.

Regional if the certain area of city is under consideration.

Combined if the above mentioned types of samples are united.

However, as we have said the result received during selective research, will differ all the same from the result received during continuous research.

It will be influenced both objective (each unit supervision contains as the general, and unique an individual features and subjective (influence of those who collects a statistical material) reasons.

Middle arithmetic value – by one number characterizes the totality, summarizing the thing, which can be concerned to all its variants, therefore it has the same dimension, that each of the variants.

М = simple

It is calculated in the cases, when variants can be observed with identical frequency and their amount is less than 30.

For example, the weight of six newborn boys is determined: 3000, 2600, 2800, 3100, 3200, 2700.

Average mixed (weighed) arithmetic is determined according to the model:

М = weighed

It is calculated in the cases when the frequency of variants is different and their amount is larger than 30.

Table 2.8. For example. Distribution of infant according to their growth

Growth, sm (X)

Number of infant, n

125

126

127

128

129

130

In a variation line the median and a module can be determined.

М = А+ i on the method of moments, where:

A – variant, that repeats oneself more frequently than the other in the variation row /conditional middle arithmetic/;

d – is a conditional deviation from conditional middle

d = V – A

It is calculated in the cases when the variants are large numbers, and their amount is hundred and thousand. This method supposes the possibility of getting not only the “М”, but also „σ” and „m”.

Properties of ordinary values.

1. The middle occupies middle position in a variation row

М = Мo = Ме /strictly symmetric row/

2. The middle is a summarizing value and it casual vibrations differences in individual information caot be seen behind it.

3. The sum of declinations of all variants from the middle equals to zero

Σ /V – M/ = 0

The third property of statistical totalities (variety of characteristic) characterizes dispersion of variants around an ordinary value.

Criteria, which determine the level of variety, are:

Limit is it is the meaning of edge variant in a variation row

lim = Vmin Vmax

Amplitude is it is the difference of edge variant of variation row

Am = Vmax – Vmin

Average quadratic deviation – characterizes dispersion of the variants around an ordinary value (inside structure of totalities).

σ = simple arithmetical method

d = V – M /genuine declination of variants from the true middle arithmetic/

σ = i method of moments

Average quadratic deviation (σ) is needed for:

1. Estimations of typicalness of the middle arithmetic (М is typical for this row, if σ is less than 1/3 of average) value.

2. Getting the error of average value.

3. Determination of average norm of the phenomenon, which is studied (М±1σ), sub norm (М±2σ) and edge deviations (М±3σ).

4. For construction of sigmal net at the estimation of physical development of an individual.

This dispersion a variant around of average characterizes an average quadratic deviation ( s )

where d – a deviation of variants from their average X-M.

If Sn £ 30, from Sn it is taken away 1.

Standard (quadratic) deviation

Definition.

The standard deviation is the positive square root of the variance.

Applications and characteristics.

The standard deviation is the most useful measure of dispersion.

In certain circumstances, quantitative probability statements that characterize a series, a sample of observations, or a total population can be derived from the standard deviation of the series, sample, or population.

Let’s make calculations for the above-stated examples:

Table 2.10

Hospital №1

Hospital №2

Number of variant_(n1)

Deviation

_(d1)

d₁²

n₁×d₁²

Number of variant_(n2)

Deviation _(d2)

d₂²

n₂×d₂²

-2

-1

Sn₁=16

Sn₁×d₁²=12

Sn₂=16

Sn₂×d₂²=22

Coefficient of variation – is the relative measure of variety; it is a percent correlation of standard deviation and arithmetic average.

С < 10 % low variety

С = 10-20 % middle variety

С > 20 % high variety

PERCENTAGES

How important are they?

An understanding of percentages is probably the first and most important concept to understand in statistics!

How easy are they to understand?

Percentages are easy to understand.

When are they used?

Percentages are mainly used in the tabulation of data in order to give the reader a scale on which to assess or compare the data.

What do they mean?

“Per cent” means per hundred, so a percentage describes a proportion of 100. For example 50% is 50 out of 100, or as a fraction 1⁄2. Other common percentages are 25% (25 out of 100 or 1⁄4), 75% (75 out of 100 or 3⁄4). To calculate a percentage, divide the number of items or patients in the category by the total number in the group and multiply by 100.

Watch out for . . .

Authors can use percentages to hide the true size of the data. To say that 50% of a sample has a certain condition when there are only four people in the sample is clearly not providing the same level of information as 50% of a sample based on 400 people. So, percentages should be used as an additional help for the reader rather than replacing the actual data.

MEAN

Otherwise known as an arithmetic mean, or average.

How important is it?

A mean appeared in 2⁄3 papers surveyed, so it is important to have an understanding of how it is calculated.

How easy is it to understand?

One of the simplest statistical concepts to grasp. However, in most groups that we have taught there has been at least one person who admits not knowing how to calculate the mean, so we do not apologize for including it here.

When is it used?

It is used when the spread of the data is fairly similar on each side of the mid point, for example when the data are “normally distributed”. The “normal distribution” is referred to a lot in statistics. It’s the symmetrical, bell-shaped distribution of data shown in Fig. 1.

What does it mean?

The mean is the sum of all the values, divided by the number of values.

Watch out for…

If a value (or a number of values) is a lot smaller or larger than the others, “skewing” the data, the mean will theot give a good picture of the typical value.

For example, if there is a sixth patient aged 92 in the study then the mean age would be 62, even though only one woman is over 60 years old. In this case, the “median” may be a more suitable mid-point to use

A common multiple choice question is to ask the difference between mean, median and mode – make sure that you do not get confused between them.

MEDIAN

Sometimes known as the mid-point.

How important is it?

It is given in over a third of mainstream papers.

How easy is it to understand?

Even easier than the mean!

When is it used?

It is used to represent the average when the data are not symmetrical, for instance the “skewed” distribution in Fig. 2.

What does it mean?

It is the point which has half the values above, and half below.

Watch out for…

The median may be given with its inter-quartile range (IQR). The 1st quartile point has the 1⁄4 of the data below it, the 3rd quartile has the 3⁄4 of the sample below it, so the IQR contains the middle 1⁄2 of the sample. This can be shown in a “box and whisker” plot.

MODE

How important is it?

Rarely quoted in papers and of limited value.

How easy is it to understand?

An easy concept.

When is it used?

It is used when we need a label for the most frequently occurring event.

What does it mean?

The mode is the most common of a set of events.

You may see reference to a “bi-modal distribution”. Generally when this is mentioned in papers it is as a concept rather than from calculating the actual values, e.g. “The data appear to follow a bi-modal distribution”. See Fig. 5 for an example of where there are two “peaks” to the data, i.e. a bi-modal distribution.

The arrows point to the modes at ages 10–19 and 60–69.

Bi-modal data may suggest that two populations are present that are mixed together, so an average is not a suitable measure for the distribution.

STANDARD DEVIATION

How important is it?

Quoted in half of papers, it is used as the basis of a number of statistical calculations.

How easy is it to understand?

LLL It is not an intuitive concept.

When is it used?

Standard deviation (SD) is used for data which are “normally distributed” (see page 9), to provide information on how much the data vary around their mean.

What does it mean?

SD indicates how much a set of values is spread around the average.

A range of one SD above and below the mean (abbreviated to ±1 SD) includes 68.2% of the values.

±2 SD includes 95.4% of the data.

±3 SD includes 99.7%.

Watch out for…

SD should only be used when the data have a normal distribution. However, means and SDs are often wrongly used for data which are not normally distributed.

A simple check for a normal distribution is to see if 2 SDs away from the mean are still within the possible range for the variable. For example, if we have some length of hospital stay data with a mean stay of 10 days and a SD of 8 days then:

mean – 2 × SD = 10 – 2 × 8 = 10 – 16 = –6 days.

This is clearly an impossible value for length of stay, so the data cannot be normally distributed. The mean and SDs are therefore not appropriate measures to use.

Good news – it is not necessary to know how to calculate the SD.

It is worth learning the figures above off by heart, so a reminder –

±1 SD includes 68.2% of the data

±2 SD includes 95.4%,

±3 SD includes 99.7%.

Keeping the “normal distribution” curve in Fig. 6 in mind may help.

Examiners may ask what percentages of subjects are included in 1, 2 or 3 SDs from the mean. Again, try to memorize those percentages.

Measurement Error

The true score theory is a good simple model for measurement, but it may not always be an accurate reflection of reality. In particular, it assumes that any observation is composed of the true value plus some random error value. But is that reasonable? What if all error is not random? Isn’t it possible that some errors are systematic, that they hold across most or all of the members of a group? One way to deal with this notion is to revise the simple true score model by dividing the error component into two subcomponents, random error and systematic error. here, we’ll look at the differences between these two types of errors and try to diagnose their effects on our research.

What is Random Error?

Random error is caused by any factors that randomly affect measurement of the variable across the sample. For instance, each person’s mood can inflate or deflate their performance on any occasion. In a particular testing, some children may be feeling in a good mood and others may be depressed. If mood affects their performance on the measure, it may artificially inflate the observed scores for some children and artificially deflate them for others. The important thing about random error is that it does not have any consistent effects across the entire sample. Instead, it pushes observed scores up or down randomly. This means that if we could see all of the random errors in a distribution they would have to sum to 0 — there would be as many negative errors as positive ones. The important property of random error is that it adds variability to the data but does not affect average performance for the group. Because of this, random error is sometimes considered noise.

What is Systematic Error?

Systematic error is caused by any factors that systematically affect measurement of the variable across the sample. For instance, if there is loud traffic going by just outside of a classroom where students are taking a test, this noise is liable to affect all of the children’s scores — in this case, systematically lowering them. Unlike random error, systematic errors tend to be consistently either positive or negative — because of this, systematic error is sometimes considered to be bias in measurement.

Reducing Measurement Error

So, how can we reduce measurement errors, random or systematic? One thing you can do is to pilot test your instruments, getting feedback from your respondents regarding how easy or hard the measure was and information about how the testing environment affected their performance. Second, if you are gathering measures using people to collect the data (as interviewers or observers) you should make sure you train them thoroughly so that they aren’t inadvertently introducing error. Third, when you collect the data for your study you should double-check the data thoroughly. All data entry for computer analysis should be “double-punched” and verified. This means that you enter the data twice, the second time having your data entry machine check that you are typing the exact same data you did the first time. Fourth, you can use statistical procedures to adjust for measurement error. These range from rather simple formulas you can apply directly to your data to very complex modeling procedures for modeling the error and its effects. Finally, one of the best things you can do to deal with measurement errors, especially systematic errors, is to use multiple measures of the same construct. Especially if the different measures don’t share the same systematic errors, you will be able to triangulate across the multiple measures and get a more accurate sense of what’s going on.

Theory of Reliability

What is reliability? We hear the term used a lot in research contexts, but what does it really mean? If you think about how we use the word “reliable” in everyday language, you might get a hint. For instance, we often speak about a machine as reliable: “I have a reliable car.” Or, news people talk about a “usually reliable source”. In both cases, the word reliable usually means “dependable” or “trustworthy.” In research, the term “reliable” also means dependable in a general sense, but that’s not a precise enough definition. What does it mean to have a dependable measure or observation in a research context? The reason “dependable” is not a good enough description is that it can be confused too easily with the idea of a valid measure (see Measurement Validity). Certainly, when we speak of a dependable measure, we mean one that is both reliable and valid. So we have to be a little more precise when we try to define reliability.

In research, the term reliability means “repeatability” or “consistency”. A measure is considered reliable if it would give us the same result over and over again (assuming that what we are measuring isn’t changing!).

Let’s explore in more detail what it means to say that a measure is “repeatable” or “consistent”. We’ll begin by defining a measure that we’ll arbitrarily label X. It might be a person’s score on a math achievement test or a measure of severity of illness. It is the value (numerical or otherwise) that we observe in our study. Now, to see how repeatable or consistent an observation is, we can measure it twice. We’ll use subscripts to indicate the first and second observation of the same measure. If we assume that what we’re measuring doesn’t change between the time of our first and second observation, we can begin to understand how we get at reliability. While we observe a score for what we’re measuring, we usually think of that score as consisting of two parts, the ‘true’ score or actual level for the person on that measure, and the ‘error’ in measuring it (see True Score Theory).

It’s important to keep in mind that we observe the X score — we never actually see the true (T) or error (e) scores. For instance, a student may get a score of 85 on a math achievement test. That’s the score we observe, an X of 85. But the reality might be that the student is actually better at math than that score indicates. Let’s say the student’s true math ability is 89 (i.e., T=89). That means that the error for that student is -4. What does this mean? Well, while the student’s true math ability may be 89, he/she may have had a bad day, may not have had breakfast, may have had an argument, or may have been distracted while taking the test. Factors like these can contribute to errors in measurement that make the student’s observed ability appear lower than their true or actual ability.

OK, back to reliability. If our measure, X, is reliable, we should find that if we measure or observe it twice on the same persons that the scores are pretty much the same. But why would they be the same? If you look at the figure you should see that the only thing that the two observations have in common is their true scores, T. How do you know that? Because the error scores (e₁ and e₂) have different subscripts indicating that they are different values. But the true score symbol T is the same for both observations. What does this mean? That the two observed scores, X₁ and X₂ are related only to the degree that the observations share true score. You should remember that the error score is assumed to be random. Sometimes errors will lead you to perform better on a test than your true ability (e.g., you had a good day guessing!) while other times it will lead you to score worse. But the true score — your true ability on that measure — would be the same on both observations (assuming, of course, that your true ability didn’t change between the two measurement occasions).

With this in mind, we caow define reliability more precisely. Reliability is a ratio or fraction. In layperson terms we might define this ratio as:

true level on the measure

the entire measure

You might think of reliability as the proportion of “truth” in your measure. Now, we don’t speak of the reliability of a measure for an individual — reliability is a characteristic of a measure that’s taken across individuals. So, to get closer to a more formal definition, let’s restate the definition above in terms of a set of observations. The easiest way to do this is to speak of the variance of the scores. Remember that the variance is a measure of the spread or distribution of a set of scores. So, we caow state the definition as:

the variance of the true score

the variance of the measure

We might put this into slightly more technical terms by using the abbreviated name for the variance and our variable names:

var(T)

var(X)

We’re getting to the critical part now. If you look at the equation above, you should recognize that we can easily determine or calculate the bottom part of the reliability ratio — it’s just the variance of the set of scores we observed (You remember how to calculate the variance, don’t you? It’s just the sum of the squared deviations of the scores from their mean, divided by the number of scores). But how do we calculate the variance of the true scores. We can’t see the true scores (we only see X)! Only God knows the true score for a specific observation. And, if we can’t calculate the variance of the true scores, we can’t compute our ratio, which meanswe can’t compute reliability! Everybody got that? The bottom line is…

we can’t compute reliability because we can’t calculate the variance of the true scores

Great. So where does that leave us? If we can’t compute reliability, perhaps the best we can do is to estimateit. Maybe we can get an estimate of the variability of the true scores. How do we do that? Remember our two observations, X₁ and X₂? We assume (using true score theory) that these two observations would be related to each other to the degree that they share true scores. So, let’s calculate the correlation between X₁ and X₂. Here’s a simple formula for the correlation:

covariance(X₁, X₂)

sd(X₁) * sd(X₂)

where the ‘sd’ stands for the standard deviation (which is the square root of the variance). If we look carefully at this equation, we can see that the covariance, which simply measures the “shared” variance between measures must be an indicator of the variability of the true scores because the true scores in X₁ and X₂ are the only thing the two observations share! So, the top part is essentially an estimate of var(T) in this context. And, since the bottom part of the equation multiplies the standard deviation of one observation with the standard deviation of the same measure at another time, we would expect that these two values would be the same (it is the same measure we’re taking) and that this is essentially the same thing as squaring the standard deviation for either observation. But, the square of the standard deviation is the same thing as the variance of the measure. So, the bottom part of the equation becomes the variance of the measure (or var(X)). If you read this paragraph carefully, you should see that the correlation between two observations of the same measure is an estimate of reliability.

It’s time to reach some conclusions. We know from this discussion that we cannot calculate reliability because we cannot measure the true score component of an observation. But we also know that we canestimate the true score component as the covariance between two observations of the same measure. With that in mind, we can estimate the reliability as the correlation between two observations of the same measure. It turns out that there are several ways we can estimate this reliability correlation. These are discussed inTypes of Reliability.

There’s only one other issue I want to address here. How big is an estimate of reliability? To figure this out, let’s go back to the equation given earlier:

var(T)

var(X)

and remember that because X = T + e, we can substitute in the bottom of the ratio:

var(T)

var(T) + var(e)

With this slight change, we can easily determine the range of a reliability estimate. If a measure is perfectly reliable, there is no error in measurement — everything we observe is true score. Therefore, for a perfectly reliable measure, the equation would reduce to:

var(T)

var(T)

and reliability = 1. Now, if we have a perfectly unreliable measure, there is no true score — the measure is entirely error. In this case, the equation would reduce to:

0

var(e)

and the reliability = 0. From this we know that reliability will always range between 0 and 1. The value of a reliability estimate tells us the proportion of variability in the measure attributable to the true score. A reliability of .5 means that about half of the variance of the observed score is attributable to truth and half is attributable to error. A reliability of .8 means the variability is about 80% true ability and 20% error. And so on.

Types of Reliability

You learned in the Theory of Reliability that it’s not possible to calculate reliability exactly. Instead, we have to estimate reliability, and this is always an imperfect endeavor. Here, I want to introduce the major reliability estimators and talk about their strengths and weaknesses.

There are four general classes of reliability estimates, each of which estimates reliability in a different way. They are:

· Inter-Rater or Inter-Observer Reliability

Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon.

· Test-Retest Reliability

Used to assess the consistency of a measure from one time to another.

· Parallel-Forms Reliability

Used to assess the consistency of the results of two tests constructed in the same way from the same content domain.

· Internal Consistency Reliability

Used to assess the consistency of results across items within a test.

Let’s discuss each of these in turn.

Inter-Rater or Inter-Observer Reliability

Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.

So how do we determine whether two observers are being consistent in their observations? You probably should establish inter-rater reliability outside of the context of the measurement in your study. After all, if you use data from your study to establish reliability, and you find that reliability is low, you’re kind of stuck. Probably it’s best to do this as a side study or pilot study. And, if your study goes on for a long time, you may want to reestablish inter-rater reliability from time to time to assure that your raters aren’t changing.

There are two major ways to actually estimate inter-rater reliability. If your measurement consists of categories — the raters are checking off which category each observation falls in — you can calculate the percent of agreement between the raters. For instance, let’s say you had 100 observations that were being rated by two raters. For each observation, the rater could check one of three categories. Imagine that on 86 of the 100 observations the raters checked the same category. In this case, the percent of agreement would be 86%. OK, it’s a crude measure, but it does give an idea of how much agreement exists, and it works no matter how many categories are used for each observation.

The other major way to estimate inter-rater reliability is appropriate when the measure is a continuous one. There, all you need to do is calculate the correlation between the ratings of the two observers. For instance, they might be rating the overall level of activity in a classroom on a 1-to-7 scale. You could have them give their rating at regular time intervals (e.g., every 30 seconds). The correlation between these ratings would give you an estimate of the reliability or consistency between the raters.

You might think of this type of reliability as “calibrating” the observers. There are other things you could do to encourage reliability between observers, even if you don’t estimate it. For instance, I used to work in a psychiatric unit where every morning a nurse had to do a ten-item rating of each patient on the unit. Of course, we couldn’t count on the same nurse being present every day, so we had to find a way to assure that any of the nurses would give comparable ratings. The way we did it was to hold weekly “calibration” meetings where we would have all of the nurses ratings for several patients and discuss why they chose the specific values they did. If there were disagreements, the nurses would discuss them and attempt to come up with rules for deciding when they would give a “3” or a “4” for a rating on a specific item. Although this was not an estimate of reliability, it probably went a long way toward improving the reliability between raters.

Test-Retest Reliability

We estimate test-retest reliability when we administer the same test to the same sample on two different occasions. This approach assumes that there is no substantial change in the construct being measured between the two occasions. The amount of time allowed between measures is critical. We know that if we measure the same thing twice that the correlation between the two observations will depend in part by how much time elapses between the two measurement occasions. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time — the closer in time we get the more similar the factors that contribute to error. Since this correlation is the test-retest estimate of reliability, you can obtain considerably different estimates depending on the interval.

Parallel-Forms Reliability

In parallel forms reliability you first have to create two parallel forms. One way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets. You administer both instruments to the same sample of people. The correlation between the two parallel forms is the estimate of reliability. One major problem with this approach is that you have to be able to generate lots of items that reflect the same construct. This is ofteo easy feat. Furthermore, this approach makes the assumption that the randomly divided halves are parallel or equivalent. Even by chance this will sometimes not be the case. The parallel forms approach is very similar to the split-half reliability described below. The major difference is that parallel forms are constructed so that the two forms can be used independent of each other and considered equivalent measures. For instance, we might be concerned about a testing threat to internal validity. If we use Form A for the pretest and Form B for the posttest, we minimize that problem. it would even be better if we randomly assign individuals to receive Form A or B on the pretest and then switch them on the posttest. With split-half reliability we have an instrument that we wish to use as a single measurement instrument and only develop randomly split halves for purposes of estimating reliability.

Internal Consistency Reliability

In internal consistency reliability estimation we use our single measurement instrument administered to a group of people on one occasion to estimate reliability. In effect we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results. We are looking at how consistent the results are for different items for the same construct within the measure. There are a wide variety of internal consistency measures that can be used.

Average Inter-item Correlation

The average inter-item correlation uses all of the items on our instrument that are designed to measure the same construct. We first compute the correlation between each pair of items, as illustrated in the figure. For example, if we have six items we will have 15 different item pairings (i.e., 15 correlations). The average interitem correlation is simply the average or mean of all these correlations. In the example, we find an average inter-item correlation of .90 with the individual correlations ranging from .84 to .95.

Average Itemtotal Correlation

This approach also uses the inter-item correlations. In addition, we compute a total score for the six items and use that as a seventh variable in the analysis. The figure shows the six item-to-total correlations at the bottom of the correlation matrix. They range from .82 to .88 in this sample analysis, with the average of these at .85.

Split-Half Reliability

In split-half reliability we randomly divide all items that purport to measure the same construct into two sets. We administer the entire instrument to a sample of people and calculate the total score for each randomly divided half. the split-half reliability estimate, as shown in the figure, is simply the correlation between these two total scores. In the example it is .87.

Cronbach’s Alpha (a)

Imagine that we compute one split-half reliability and then randomly divide the items into another set of split halves and recompute, and keep doing this until we have computed all possible split half estimates of reliability. Cronbach’s Alpha is mathematically equivalent to the average of all possible split-half estimates, although that’s not how we compute it. Notice that when I say we compute all possible split-half estimates, I don’t mean that each time we go an measure a new sample! That would take forever. Instead, we calculate all split-half estimates from the same sample. Because we measured all of our sample on each of the six items, all we have to do is have the computer analysis do the random subsets of items and compute the resulting correlations. The figure shows several of the split-half estimates for our six item example and lists them as SH with a subscript. Just keep in mind that although Cronbach’s Alpha is equivalent to the average of all possible split half correlations we would never actually calculate it that way. Some clever mathematician (Cronbach, I presume!) figured out a way to get the mathematical equivalent a lot more quickly.

Comparison of Reliability Estimators

Each of the reliability estimators has certain advantages and disadvantages. Inter-rater reliability is one of the best ways to estimate reliability when your measure is an observation. However, it requires multiple raters or observers. As an alternative, you could look at the correlation of ratings of the same single observer repeated on two different occasions. For example, let’s say you collected videotapes of child-mother interactions and had a rater code the videos for how often the mother smiled at the child. To establish inter-rater reliability you could take a sample of videos and have two raters code them independently. To estimate test-retest reliability you could have a single rater code the same videos on two different occasions. You might use the inter-rater approach especially if you were interested in using a team of raters and you wanted to establish that they yielded consistent results. If you get a suitably high inter-rater reliability you could then justify allowing them to work independently on coding different videos. You might use the test-retest approach when you only have a single rater and don’t want to train any others. On the other hand, in some studies it is reasonable to do both to help establish the reliability of the raters or observers.

The parallel forms estimator is typically only used in situations where you intend to use the two forms as alternate measures of the same thing. Both the parallel forms and all of the internal consistency estimators have one major constraint — you have to have multiple items designed to measure the same construct. This is relatively easy to achieve in certain contexts like achievement testing (it’s easy, for instance, to construct lots of similar addition problems for a math test), but for more complex or subjective constructs this can be a real challenge. If you do have lots of items, Cronbach’s Alpha tends to be the most frequently used estimate of internal consistency.

The test-retest estimator is especially feasible in most experimental and quasi-experimental designs that use a no-treatment control group. In these designs you always have a control group that is measured on two occasions (pretest and posttest). the main problem with this approach is that you don’t have any information about reliability until you collect the posttest and, if the reliability estimate is low, you’re pretty much sunk.

Each of the reliability estimators will give a different value for reliability. In general, the test-retest and inter-rater reliability estimates will be lower in value than the parallel forms and internal consistency ones because they involve measuring at different times or with different raters. Since reliability estimates are often used in statistical analyses of quasi-experimental designs (e.g., the analysis of the nonequivalent group design), the fact that different estimates can differ considerably makes the analysis even more complex.

Reliability & Validity

We often think of reliability and validity as separate ideas but, in fact, they’re related to each other. Here, I want to show you two ways you can think about their relationship.

One of my favorite metaphors for the relationship between reliability is that of the target. Think of the center of the target as the concept that you are trying to measure. Imagine that for each person you are measuring, you are taking a shot at the target. If you measure the concept perfectly for a person, you are hitting the center of the target. If you don’t, you are missing the center. The more you are off for that person, the further you are from the center.

The figure above shows four possible situations. In the first one, you are hitting the target consistently, but you are missing the center of the target. That is, you are consistently and systematically measuring the wrong value for all respondents. This measure is reliable, but no valid (that is, it’s consistent but wrong). The second, shows hits that are randomly spread across the target. You seldom hit the center of the target but, on average, you are getting the right answer for the group (but not very well for individuals). In this case, you get a valid group estimate, but you are inconsistent. Here, you can clearly see that reliability is directly related to the variability of your measure. The third scenario shows a case where your hits are spread across the target and you are consistently missing the center. Your measure in this case is neither reliable nor valid. Finally, we see the “Robin Hood” scenario — you consistently hit the center of the target. Your measure is both reliable and valid (I bet you never thought of Robin Hood in those terms before).

Another way we can think about the relationship between reliability and validity is shown in the figure below. Here, we set up a 2×2 table. The columns of the table indicate whether you are trying to measure the same or different concepts. The rows show whether you are using the same or different methods of measurement. Imagine that we have two concepts we would like to measure, student verbal and math ability. Furthermore, imagine that we can measure each of these in two ways. First, we can use a written, paper-and-pencil exam (very much like the SAT or GRE exams). Second, we can ask the student’s classroom teacher to give us a rating of the student’s ability based on their own classroom observation.

The first cell on the upper left shows the comparison of the verbal written test score with the verbal written test score. But how can we compare the same measure with itself? We could do this by estimating the reliability of the written test through a test-retest correlation, parallel forms, or an internal consistency measure (See Types of Reliability). What we are estimating in this cell is the reliability of the measure.

The cell on the lower left shows a comparison of the verbal written measure with the verbal teacher observation rating. Because we are trying to measure the same concept, we are looking at convergent validity (See Measurement Validity Types).

The cell on the upper right shows the comparison of the verbal written exam with the math written exam. Here, we are comparing two different concepts (verbal versus math) and so we would expect the relationship to be lower than a comparison of the same concept with itself (e.g., verbal versus verbal or math versus math). Thus, we are trying to discriminate between two concepts and we would consider this discriminant validity.

Finally, we have the cell on the lower right. Here, we are comparing the verbal written exam with the math teacher observation rating. Like the cell on the upper right, we are also trying to compare two different concepts (verbal versus math) and so this is a discriminant validity estimate. But here, we are also trying to compare two different methods of measurement (written exam versus teacher observation rating). So, we’ll call this very discriminant to indicate that we would expect the relationship in this cell to be even lower than in the one above it.

The four cells incorporate the different values that we examine in the multitrait-multimethod approach to estimating construct validity.

When we look at reliability and validity in this way, we see that, rather than being distinct, they actually form a continuum. On one end is the situation where the concepts and methods of measurement are the same (reliability) and on the other is the situation where concepts and methods of measurement are different (verydiscriminant validity).

Estimation of authenticity of results statistical research.

The necessity estimation of authenticity got results is determined by volume of research. In full research (general aggregate), when all units of supervision are explored it is possible to get only one value of certain index. The general aggregate is always reliable because in it included her all units of supervision are included. General aggregate official statistics can exemplify.

The general aggregate is rarely used in medical – biologic researches, mainly part of researches is selective. The law of large numbers is basis for forming of reliable selective aggregate. It sounds so: it is possible to assert with large authenticity, that at achievement of large number of supervisions average of sign, which is studied in a selective aggregate will be a little to differ from an average which is studied at all general aggregate. The selective aggregate always has errors, because not all units of supervision are included in research. Authenticity of selective research depends from the size of this error. That is why greater number of supervisions, teed to less error, the less sizes of casual vibrations of index. That, to decrease an error it is needed to multiply the number of supervisions.

Authenticity (representation) of statistical research is this as far as a selective aggregate answers general.

Basic criteria of authenticity (representation):

1. Error of representation (w)

2. Confiding scopes

3. The coefficient of authenticity (the student criterion) is authenticity of difference of middle or relative sizes (t)

1. The errors of representation of /m/ are the degree of authenticity of middle or relative size shows, on how much the results of selective research differ from results which it is possible to get from continuous study of general aggregate.

а) The error of ordinary size is determined after a formula:

, if> 30;, if< 30.

where δ is standard deviation,

n is number of supervisions in a selective aggregate. At the small number of supervisions (n < 30) in a denominator in place of -1 is used

Let’s stop on examples. As it has already been said, the average value one number characterizes many variants, but these variants can be different.

Table 2.9

1. Average duration of treatment of traumas in hospitals №1

2. Average duration of treatment of traumas in hospitals №2

Variant

(Х₁)

Number of variant (n₁)

X₁n₁

Variant

(Х₂)

Number of variant (n₂)

X₂n₂

S n₁=16

S X₁n₁=112

S n₂=16

S X₂n₂=112

Average , .

So, in both hospitals average duration of treatment of traumas is identical – 7 days. But it is visible to us, that in hospital №1 average displays essence of the matter affair, than in hospital № 2 better. In the first hospital variants are less dispersed around of average, than in the second: 10 variant are equaled by average (in the second hospital – 6) while extreme deviations (5 and 9 days) in the first hospital have on one variant, in the second – on two.

Now we shall determine errors of both average

1. So, in 68.3 % of cases M it will be equal to:

in the first case 7 ± 0.23 days;

in the second case 7 ± 0.31 days.

In 95.5 % of cases_M it will be equal to:

1.7 ± 0.46 days;

2.7 ± 0.62 days.

In the above stated examples average duration of treatment concerning traumas in hospitals identical (7 days) though the average quadratic deviation testifies, that it is typical for concrete cases. But it happens seldom.

б) The error of relative size is determined after a formula:

if> 30;, if< 30.

where р is relative size (in % or ‰);

q = 100 – р (if in %) and q = 1000 – р (if in ‰);

n is number of supervisions.

So, the greatest the error will be if the alternative is equal to

0.5 on 0.5 or 50 % on 50 %.

In all other cases it will be smaller.

Let’s consider the above-stated example further. Instead of 200 thousand we select 200 city dwellers. In them 150 diseases are revealed. So, disease р will be equal to.

, g=100-75=25 %.

Let’s determine m under condition of different Parameter of reliability t.

If t = 1 according to the law of the big numbers in 68.3 % of cases the result is received at selective research, on value -+ m will differ from result of continuous research :

So, in 68.3 % -– 75 % – 3.1 % will be equal to. If t = 2 in 95.5 % of cases

will differ from p on 2m, that is he will be equal to 75 % – 6.2 %. This result is considered comprehensible for statistical researches in public health services.

So, we have two conclusions:

1. It is possible to be satisfied with result of selective research, if the part from division of this result into its error is equaled 2 and more;

2. It is possible to determine necessary number of examination for receiving authentic result during selective research, having made the necessary transformations to the above-stated formula:

2. Confiding scopes – properties of selective aggregate are carried on general one, probability oscillation of index is shown in the general aggregate, its extreme values of minimum and maximal possibility, which the size of general aggregate can be within the limits of.

а) Confiding scopes of averages

M_gen = М_sel ± mt

where М_gen is the expected average in a general aggregate,

М_sel is the average got in selective research

m is error of average (see a previous formula)

t is criterion of authenticity or confiding criterion

б) Confiding scopes of relative sizes

Р_gen = Р_sel ± mt

where Р_gen is the expected relative size in a general aggregate,

Р_sel is the relative size got in selective research

m is error of relative size (see a previous formula)

t is criterion of authenticity or confiding criterion

Value of criterion of authenticity or confiding criterion (t)

The size of criterion authenticity (t) is determined depending from the reliable faultless prediction of р and quantity of supervisions in selective aggregate:

а) If the number of supervisions less than 30 t its determined from the special tables.

б) If number of supervisions more than 30 that its planned the faultless prognosis of р will hesitate within the limits of 95,5 % (the selective aggregate will answer general on 95,5 %) then t is to be evened 2.

For most medical – biologic and social researches confiding scopes set with authenticity of faultless prognosis of р are considered reliable – 95,5 % (that t = 2).

в) If the number of supervisions more than 30 that it is planned that the faultless prognosis of р will hesitate within the limits of 99,7 % (that a selective aggregate will answer general on 99,7 %) then t is to be evened 3.

3. The coefficient of authenticity (the Student’s criterion) is authenticity of difference of middle or relative sizes (t). The student’s Criterion shows the difference of the proper indexes in two separate selective aggregates.

а) the student’s criterion for ordinary sizes

where t is criterion of authenticity

М1 is average in the first statistical aggregate

М2 is the proper average in the second statistical aggregate

m1 is error of average in the first statistical aggregate

m2 is error of average in the second statistical aggregate

б) the student’s criterion for relative sizes

where t is criterion of authenticity

Р1 is relative index in the first statistical aggregate

Р2 is the proper relative index in the second statistical aggregate

m1 is error of relative index in the first statistical aggregate

m2 is error of relative index in the second statistical aggregate

If amount of supervisions less than 30 information about authenticity of research (t) is determined from tables. For example, if research amount 27 the value of t is to be no less than 2,77.

If amount of supervisions more than 30 for many medical – biologic, social researches of t has or to be evened or be exceeded 2. At t < 2 researches are considered unreliable.

As a rule, parameters differ between themselves.

There are questions: if the difference is essential (that is caused by the objective reasons) or, on the contrary, insignificant.

For example, in hospital № 1 average duration of treatment of patients with hyper tonic disease made 17 days (an error of a parameter per 1 day), and in hospital № 2-15 days (an error of a parameter per 0.5 days).

Whether essential there is a difference or, really, in hospital № 1 treat the same disease longer, than in hospital № 2?

We find the answer according to the formula:

If t >– 2 – a difference, as a rule, essential, if it is less 2 – insignificant).

The estimation of reliable investigation results

I. Calculation of mistakes average and relative values

II. Calculation of confidence limits of average and relative values

Calculation of mistakes average values (m_µ)

m_µ=σ/√n (if n>30)

m_µ= σ/√n-1(if n<30)

Calculation of mistakes relative values (m_p)

m_p=√pq/n (if n>30)

m_p=√pq/n-1 (if n<30)

Calculation of confidence limits of average values (µ_gen)

µ_gen=µ_sel±t_m

Calculation of confidence limits of relative values (P_gen)

P_gen= P_sel± t_m

III. Calculation of confidence of average and relative values difference

Calculation of confidence of two average values difference

T=M₁-M₂/√m₁²+ m₂²

(if T≥2 then the difference is confident)

Calculation of confidence of two relative values difference

T=P₁-P₂/√m₁²– m₂²

(if T≥2 then the difference is confident )

Evaluation of confidence of statistic investigation results

Confidence of medium figures

Confidence of relative figures

For> 30

For< 30

(confidence of difference between medium figures)

For> 30

q=100 – p

For 30

(confidence of difference between relative figures)

The Chi-Square Distribution

The statistical-inference procedures discussed in this chapter rely on a distribution called the chi-square distribution. Chi (pronounced “kı”) is a Greek letter whose lowercase form is χ. A variable has a chi-square distribution if its distribution has the shape of a special type of right-skewed curve, called a chi-square (χ2) curve. Actually, there are infinitely many chi-square distributions, and we identify the chi-square distribution (and χ2-curve) in question by its number of degrees of freedom, just as we did for t-distributions. Figure 12.1 shows three χ2-curves and illustrates some basic properties of χ2-curves.

Using the χ2-Table

Percentages (and probabilities) for a variable that has a chi-square distribution are equal to areas under its associated χ2-curve. To perform a chi-square test, we need to know how to find the χ2-value that has a specified area to its right. Table V in Appendix A provides χ2-values that correspond to several areas.

The χ2-table (Table V) is similar to the t-table (Table IV). The two outside columns of Table V, labeled df, display the number of degrees of freedom. As expected, the symbol χ2 α denotes the χ2-value that has area α to its right under a χ2-curve. Thus the column headed χ2 ₀_.₀₅, for example, contains χ2-values that have area 0.05 to their right.

EXAMPLE 12.1 Finding the χ2-Value Having a Specified Area to Its Right

For a χ2-curve with 12 degrees of freedom, find χ2 ₀_.₀₂₅; that is, find the χ2-value that has area 0.025 to its right, as shown in Fig. 12.2(a).

Solution

To find this χ2-value, we use Table V. The number of degrees of freedom is 12, so we first go down the outside columns, labeled df, to “12.” Then, going across that row to the column labeled χ2 ₀_.₀₂₅, we reach 23.337. This number is the χ2-value having area 0.025 to its right, as shown in Fig. 12.2(b). In other words, for a χ2-curve with df = 12, χ2 ₀_.₀₂₅ = 23.337.

Chi-Square Goodness-of-Fit Test

Our first chi-square procedure is called the chi-square goodness-of-fit test. We can use this procedure to perform a hypothesis test about the distribution of a qualitative (categorical) variable or a discrete quantitative variable that has only finitely many possible values. We introduce and explain the reasoning behind the chi-square goodness-of-fit test next.

EXAMPLE 12.2 Introduces the Chi-Square Goodness-of-Fit Test

Violent Crimes The Federal Bureau of Investigation (FBI) compiles data on crimes and crime rates and publishes the information in Crime in the United States. A violent crime is classified by the FBI as murder, forcible rape, robbery, or aggravated assault. Table 12.1 gives a relative-frequency distribution for (reported) violent crimes in 2000. For instance, in 2000, 28.6% of violent crimes were robberies.

A simple random sample of 500 violent-crime reports from last year yielded the frequency distribution shown in Table 12.2. Suppose that we want to use the data in Tables 12.1 and 12.2 to decide whether last year’s distribution of violent crimes is changed from the 2000 distribution.

a. Formulate the problem statistically by posing it as a hypothesis test.

b. Explain the basic idea for carrying out the hypothesis test.

c. Discuss the details for making a decision concerning the hypothesis test.

Solution

a. The population is last year’s (reported) violent crimes. The variable is “type of violent crime,” and its possible values are murder, forcible rape, robbery, and aggravated assault. We want to perform the following hypothesis test.

H0: Last year’s violent-crime distribution is the same as the 2000 distribution.

Ha: Last year’s violent-crime distribution is different from the 2000 distribution.

b. The idea behind the chi-square goodness-of-fit test is to compare the observed frequencies in the second column of Table 12.2 to the frequencies that would be expected—the expected frequencies—if last year’s violent-crime distribution is the same as the 2000 distribution. If the observed and expected frequencies match fairly well (i.e., each observed frequency is roughly equal to its corresponding expected frequency), we do not reject the null hypothesis; otherwise, we reject the null hypothesis.

c. To formulate a precise procedure for carrying out the hypothesis test, we need to answer two questions:

1. What frequencies should we expect from a random sample of 500 violentcrime reports from last year if last year’s violent-crime distribution is the same as the 2000 distribution?

2. How do we decide whether the observed and expected frequencies match fairly well?

The first question is easy to answer, which we illustrate with robberies. If last year’s violent-crime distribution is the same as the 2000 distribution, then, according to Table 12.1, 28.6% of last year’s violent crimes would have been robberies. Therefore, in a random sample of 500 violent-crime reports from last year, we would expect about 28.6% of the 500 to be robberies. In other words, we would expect the number of robberies to be 500·0.286, or 143.

In general, we compute each expected frequency, denoted E, by using the Formula

E = np,

where n is the sample size and p is the appropriate relative frequency from the second column of Table 12.1. Using this formula, we calculated the expected frequencies for all four types of violent crime. The results are displayed in the second column of Table 12.3.

The second column of Table 12.3 answers the first question. It gives the frequencies that we would expect if last year’s violent-crime distribution is the same as the 2000 distribution.

The second question—whether the observed and expected frequencies match fairly well—is harder to answer. We need to calculate a number that measures the goodness of fit.

In Table 12.4, the second column repeats the observed frequencies from the second column of Table 12.2. The third column of Table 12.4 repeats the expected frequencies from the second column of Table 12.3.

To measure the goodness of fit of the observed and expected frequencies, we look at the differences, O − E, shown in the fourth column of Table 12.4. Summing these differences to obtain a measure of goodness of fit isn’t very useful because the sum is 0. Instead, we square each difference (shown in the fifth column) and then divide by the corresponding expected frequency. Doing so gives the values (O − E)²/E, called chi-square subtotals, shown in the sixth column. The sum of the chi-square subtotals,

∑(O − E)²/E = 3.555,

is the statistic used to measure the goodness of fit of the observed and expected frequencies. (Using subscripts alone or both subscripts and indices, we would write ∑ (O − E)²/E as

∑ (Oi − Ei )²/Ei or (Oi − Ei )²/Ei ,

where c denotes the number of possible values for the variable, in this case, four (c = 4). However, because no confusion can arise, we use the simpler notation without subscripts or indices.

If the null hypothesis is true, the observed and expected frequencies should be roughly equal, resulting in a small value of the test statistic, ∑ (O − E)²/E. In other words, large values of ∑ (O − E)²/E provide evidence against the null hypothesis.

As we have seen, ∑ (O − E)²/E = 3.555. Can this value be reasonably attributed to sampling error, or is it large enough to suggest that the null hypothesis is false? To answer this question, we need to know the distribution of the test statistic ∑ (O − E)²/E. First we present the formula for expected frequencies in a chi-square goodnessof- fit test, as discussed in the preceding example, and then we provide the distribution of the test statistic for a chi-square goodness-of-fit test.

Essence of correlation

Correlation is a measure of mutual correspondence between two variables and is denoted by the coefficient of correlation.

Applications and characteristics

a) The simple correlation coefficient, also called the Pearson’s product-moment correlation coefficient, is used to indicate the extent that two variables change with one another in a linear fashion.

b) The correlation coefficient can range from – 1 to + 1 and is unites (Fig. A, B, C).

c) When the correlation coefficient approaches – 1, a change in one variable is more highly, or strongly, associated with an inverse linear change (i.e., a change in the opposite direction) in the other variable (Fig.A).

d) When the correlation coefficient equals zero, there is no association between the changes of the two variables (Fig.B).

(e) When the correlation coefficient approaches +1, a change in one variable is more highly, or strongly, associated with a direct linear change in the other variable (Fig.C).

A correlation coefficient can be calculated validly only when both variables are subject to random sampling and each is chosen independently.

Although useful as one of the determinants of scientific causality, correlation by itself is not equivalent to causation.

For example, two correlated variables may be associated with another factor that causes them to appear correlated with each other.

A correlation may appear strong but be insignificant because of a small sample size.

Table 2.12

Correlation connection

There are the following types of communication (relation) between the phenomena and signs iature:

а) the reason-result connection is the connection between factors and phenomena, between factor and result signs.

б) the dependence of parallel changes of a few signs on some third size.

The quantitative types of connection: functional one is the connection, at which the strictly defined value of the second sign answers to any value of one of the signs (for example, the certain area of the circle answers to the radius of the circle); correlation connection is the connection, at which a few values of one sign answer to the value of every average size of another sign associated with the first one (for example, it is known that the height and mass of man’s body are linked between each other; in the group of persons with identical height there are different valuations of mass of body, however, these valuations of body mass varies in certain sizes – round their average size).

Correlation is a concept, which means the interconnection between the signs.

Correlative connection foresees the dependence between the phenomena, which do not have clear functional character.

Correlative connection is showed up only in the mass of supervisions that is in totality. The establishment of correlative connection foresees the exposure of the causal connection, which will confirm the dependence of one phenomenon on the other one.

Correlative connection by the direction (the character) of connection can be direct and reverse. The coefficient of correlation, that characterizes the direct communication, is marked by the sign plus (+), and the coefficient of correlation, that characterizes the reverse one, is marked by the sign minus (-).

By the force the correlative connection can be strong, middle, weak, it can be full and it can be absent.

THE SCHEME OF THE ESTIMATION OF CORRELATIve CONNECTION BY THE COEFFICIENT OF CORRELATION

The force of connection

Line (+)

Reverse (-)

Complete

Strong

From +1 to +0,7

From -1 to -0,7

Middle

from +0,7 to +0,3

from –0,7 to –0,3

Weak

from +0,3 to 0

from –0,3 to 0

The connection is absent

The correlative connection can be:

1. By the direction

– direct (+) – with the increasing of one sign increases the middle value of another one;

– reverse (-) – with the increasing of one sign decreases the middle value of another one;

2. By the character

– rectilinear – relatively even changes of middle values of one sign are accompanied by the equal changes of the other (arterial pressure minimal and maximal)

– curvilinear – at the even change of one sing there can be the increasing or decreasing middle values of the other sign.

Мethods of determination of the coefficient of correlation:

The coefficient of correlation (r_ху) by one number gives the picture of the direction and force of connection between the explored phenomena.

The method of squares or the Pyrson’s method is frequently used for the determination of coefficient of correlation

, where:

х and y are the signs, between which the connection is determined

d_x and d_y are the deviation of each variants from their middle arithmetic, calculated in a number of sign х and in a number of sign y (М_х and М_у)

∑ is the sum of signs.

The second method of determination of the coefficient of correlation is the method of grades or the Speerman’s method. It is used, whe<30 and if it is enough to have only oriented information for the estimation of character (direction) and the force of connection.

, where:

х and y are the signs, between which the connection is determined,

6 is a permanent coefficient,

d is a difference of grades,

n is a number of supervisions.

The determination of the error of the coefficient of grade correlation (that was determined by the Speerman’s method) and criterion t.

and

A criterion must be 2 and more, so that Р = 95,5 % and more.

Confidence of correlation coefficient

Criteria t should be 3 that corresponds to probability of mistakes prognosis (p) ≥ 99,7%

Student’s tests (t) based on the t distribution, which reflects greater variation dye to chance than the normal distribution are used to analyze small samples.

The (t) distribution is a continuous, symmetrical, unimodal distribution of infinite range, which is bell-shaped, similar to the shape of the normal distribution, but more spread out.

As the sample size increases, the t distribution closely resembles the normal distribution. At infinite degrees of freedom, the t and normal distributions are identical, and the t values equal the critical ratio values.

Table 2.13 Table of Critical ratio (abbreviated)

Probability that Value

Lies

Critical ration

Within critical

The ratio

Within ± the Critical ratio

Outside ± the Critical ratio

1.0

.341

.683

.317

1.645

.450

.900

.100

1.96

.475

.950

.050

2.0

.477

.945

.046

2.567

.495

.990

.010

3.0

.499

.997

.003

Fig. 2. The standardized normal distribution shown with the percentage of values included between critical ratios from the mean.

A. Student’s test for a single small sample

Student’s t test for a single small sample compares a single sample with a population.

Student’s t tests are used to evaluate the null hypothesis for continuous variables for sample sizes less than 30.

The t table.

Probability values are derived from the t value and the number of degrees of freedom by using the t table for each degree of freedom, a row of increasing t values corresponds to a row of decreasing probabilities for accepting the null hypothesis the value of the t.

Confidence intervals.

In small samples, especially sample sizes less than 30, the t distribution is used to calculate confidence intervals around the sample mean.

The (t) table (abbreviated)

Probability

Degrees of freedom (df)

.10

.05

.01

6.31

12.71

63.66

2.92

4.30

9.93

1.86

2.31

3.36

1.83

2.26

3.25

1.81

2.23

3.17

1.64

1.96

2.58

The method of grade correlation is used, when: there is the small quantity of observations, the exact calculations are not needed; the variation rows are opened, verbal expression of sign (for example, the diagnosis of disease)

The order of determination of grade correlation coefficient:

1) make the variation rows from pair signs

2) replace every value of variant by a grade (index) number

3) define the difference of grades: d = x – y

4) bring difference of grades to the square – d²

5) get the sum squares of difference of grades – ∑d²

6) define r_xy a formula

7) define the direction and force of connection

8) define the error of mr_xy and the criterion t and estimate the authenticity of faultless prognosis – p

References:

1. David Machin. Medical statistics: a textbook for the health sciences / David Machin, Michael J. Campbell, Stephen J Walters. – John Wiley & Sons, Ltd., 2007. – 346 p.

2. Nathan Tintle. Introduction to statistical investigations / Nathan Tintle, Beth Chance, George Cobb, Allan Rossman, Soma Roy, Todd Swanson, Jill VanderStoep. – UCSD BIEB100, Winter 2013. – 540 p.

3. Armitage P. Statistical Methods in Medical Research / P. Armitage, G. Berry, J. Matthews. – Blaskwell Science, 2002. – 826 p.

4. Larry Winner. Introduction to Biostatistics / Larry Winner. – Department of Statistics University of Florida, July 8, 2004. – 204 p.

5. Weiss N. A. (Neil A.) Elementary statistics / Neil A. Weiss; biographies by Carol A. Weiss. – 8th ed., 2012. – 774 p.