DIRECT METHOD OF STANDARDIZATION OF INDICES.
Standardization
Comparison of indices in totalities, which differs by their structure, needs their standardization that means correction on condition that the structure of totalities will be taken to the unique standard.
The following quantities are used in medical statistics:
· absolute – the absolute quantities of the phenomenon, environment are represented
· average – the variant type of characteristic distributing are represented
· relative – the alternative type of characteristic distributing are represented
Intensive index – shows the level, expansion (spread) of the phenomenon; it is used for the comparison of two and more statistical totalities, which are in different in amount.
ІІ = |
the absolute quantities of the phenomenon |
× 1000 |
the absolute quantities of the environment |
Example: environment – 11 students (statistical totality)
phenomenon morbidity: a caries – 5 students
a goitre is 3 students
gastritis is 4 students
ІІ = ‰
environment – 40 schoolboys
morbidity – 15 schoolboys ІІ =‰
The method of standardization is used in the case, when the environments are heterogeneous (by age, sex…)
Standardization is the method of calculation of conditional (standardized) indices.
The essence of standardization method of indices consists in the calculation of conditional (standardized) indices, which substitute the intensive or other quantities in those cases, when comparison of these indices is complicated through the impossibility of comparison of groups structure.
The standardized indices are conditional, because they indicate, what these indices were, if the influence of this factor that interferes their comparison, was absent accessory removing the influence of this or that factor on the veritable (real) indices. The standardized indices can be used only with the purpose of comparison, because they don’t give imagination about the real sizes of the phenomenon.
There are different methods of calculation of the standardized indices. The most widespread method is the direct one.
The direct method of standardization is used at:
а) considerable divergences of levels of group indices (for example, different levels of lethality in hospitals or departments, different levels of morbidity for men and women, and others);
б) considerable heterogeneity of totalities, which are compared.
The standardized indices show, what were the veritable indices, if the influence of some certain factor was not present. They allow to level any influence on the indices.
Name the stages of direct method of standardization.
I -st stage is the calculation of general intensive indices (or averages) for the pair of totalities, which are compared;
ІІ-nd stage – choice and calculation of standard. As a standard they most frequently use the half-sum of two groups (totalities), which are compared;
ІІІ-rd stage is the calculation of „expected quantities” in every group of standard;
The forth ІV stage is the determination of the standardized indices;
The fifth V stage is the comparison of groups according to the intensive and standardized indices.
Conclusions.
In the conclusions it must be noted that the standardized index – is the conditional index, which answer only the question – what was the level of the phenomenon that is studied, if the conditions of its origin were standard.
The ordinary intensive indices, characterizes the level, frequency of the phenomenon, because they are true and may change depending on the size of the taken standard.
The stages of direct method of standardization:
1. Calculation of general intensive (or average) indices in compared groups.
2. Choice and calculation of the standard.
3. Calculation of “expected” figures in every group of the standard.
4. Determination of standardized indices.
5. Comparation of simple intensive and standardized indices. Conclusions.
Usage of standardized indices:
1. Comparative evaluation of demographic indices in different age and social groups.
2. Comparative analysis of morbidity in different age and social groups.
3. Comparative evaluation of treatment quality in hospitals with different content of patients in departments.
The types of values which exist in science:
· absolute – the absolute size of the phenomenon, environment are represented
· average – the variant type of signs distribution are represented
· relative – the alternative type of signs distribution are represented
Table 2.14
Example. Average duration of treatment in the hospitals
Department Hospital №1 Hospital №2 |
|
Number Bed days The period Number Bed days The period |
of patients of treatment of patients of treatment |
|
Therapeutic 2100 33180 15,8 970 16296 16,8 |
Surgical 560 5320 9,5 990 9702 9,8 |
Gynecologic 580 4060 7,0 1020 7650 7,5 |
Total 3240 42560 13,1 2980 33648 11,3 |
As we see, average term of treatment in the hospital №2 is much lower in comparison with hospital №1. But the analysis of these parameters in separate branches testifies to an inaccuracy of this conclusion.
In hospital №1 therapeutic patients prevail, and in hospitals №2 – gynecologic, which terms of treatment essentially differ.
Standard definition
Branch |
Hospital № 1 |
Hospital № 2 |
The standard |
|||
|
Number of patients |
% |
Number of patients |
% |
Number of patients |
% |
Therapeutic |
2100 |
64,8 |
970 |
32,6 |
3070 |
49,4 |
Surgical |
560 |
17,3 |
990 |
33,2 |
1550 |
24,9 |
Gynecologic |
580 |
17,9 |
1020 |
34,2 |
1600 |
25,7 |
Total |
3240 |
100,0 |
2980 |
100,0 |
6220 |
100,0 |
Let’s determine average duration of treatment in both hospitals provided that the structure of hospitalized patients would be identical.
Branch |
Standard distribution of sick (%) |
Hospital №1 |
Hospital №2 |
Standard distribution we multiply for the term of treatment |
Standard distribution we multiply for the term of treatment |
||
Therapeutic |
49,4 |
49,4 × 15,8 : 100 = 7,8 |
49,4 × 16,8 : 100 = 8,3 |
Surgical |
24,9 |
24,9 × 9,5 : 100 = 2,4 |
24,9 × 9,8 : 100 = 2,4 |
Gynecologic |
25,7 |
25,7 × 7,0 : 100 = 1,8 |
25,7 × 7,5 : 100 = 1,9 |
Total |
100,0 |
standard parameter 12,0 |
standard parameter 12,6 |
To understand norms and statistical assessment one first needs to understand standardization. Standardization is the process of testing a group of people to see the scores that are typically attained. With a standardized test, the participant can compare where that score fell compared to the standardization group’s performance. With standardization the normative group must reflect the population for which the test was designed. The group’s performance is the basis for the tests norms.
Many major psychological measures are norm-based, meaning that the score for an individual is interpreted by comparing his/her score with the scores of a group of people who define the norms for the test. Sir Francis Galton developed the logic for norm-based testing in the mid 1800s.
To organize and summarize data for normative purposes, begin by grouping the data into a frequency distribution. The information provided by frequency distributions can be presented graphically in the form of a bell-shaped normal distribution curve, as long as it approximates that symmetrical form. A group of scores can be summarized by a measure of central tendency. The most familiar of these measures is the arithmetic average, more technically known as the mean (M), and is found by adding all the scores (X) and dividing the sum by the total number of items (N), (M = ∑X/N).
To create a successful, accepted IQ or aptitude test, these following three factors must be considered. Without them, a test will not give accurate, reproducible results.
Standardization
The standardization of a test is the normalization of test, that is, the finding of certaiormal scores when the test is given to a pilot group who are similar to the people taking the test. The process of finding norms also requires that different scores are found in a normal distribution – the number of questions right or wrong must be different from person to person in a large sample of people. After finding the meaorms and the standard deviation, the process of standardization should make the test results fall into a rough bell curve.
Without standardization, all of the people taking a test could end up with all wrong or all correct answers, making it impossible to distinguish between each. Also, without knowing the statistical distribution, the test grader would not be able to find the percentile, and thus the score, of a person answering a certaiumber of questions correctly. Since most modern tests such as the Weschler Adult Intelligence Scale and the SAT I am based on a statistical method of grading, standardization is crucial to find a tester’s score.
Direct standardization
There are two basic methods of standardization, or age-adjustment; both were introduced in the 19th century. These two methods have become known as the direct and indirect methods. (Indirect standardization is discussed in a later section.) When the direct standardization method is applied to ASDRs, the resultant summary index is called the ADR. Two assumptions are made when this index is computed for a population: The population’s observed age-specific rates are assumed to be valid, and the age distribution of the population is assumed to be that of a standard, or reference, or population.
Table B illustrates the calculation of the ADR using the hypothetical data from table A. Specific computational formulae for the ADR are given in the technical appendix.
To calculate the ADR, the standard population and the age-specific death rate for each age interval are multiplied and these products are summed. In this example, the total for community A is 420. This sum is divided by the total standard population (10,000 in this case) to obtain the ADR. As with crude rates, the ADR is usually expressed in terms of a rate per 1,000 or per 100,000 population. Thus, the ADR for community A is 42 deaths per 1,000 population and the ADR for community B is 52 per 1,000. Note that, although the crude rate for community A was larger than that for community B, the ADR for community A is smaller than the ADR for community B. This is consistent with each of the age-specific rates for community A being smaller than those of community B.
Because of the method of computation, the age-adjusted rate is often interpreted as the hypothetical death rate that would have occurred if the observed age-specific rates were present in a population whose age distribution is that of the standard population. It is very important to realize that the ADR is an artificial measure whose absolute value has no intrinsic meaning. The ADR is useful for comparison purposes only, not to measure absolute magnitude. (To compare absolute magnitude, crude rates are used.) It is also important to note that in order to compare two age-adjusted rates, the same standard population must have been used.
Selection of a standard population
After the decision to use an ADR is made, the standard population must be selected. There are two basic types of standard populations, internal and external. Internal standards are created from the data to be used in the analysis; for example, the average age distribution of all populations to be compared. The use of an internal standard has certain statistical advantages for the ADR. However, if an internal standard is used, the results cannot be directly compared to other studies that use adjusted rates computed using a different standard population.
External standards are standard populations drawn from sources outside the analysis. For example, the National Center for Health Statistics (NCHS) typically uses a standard based on the 1940 United States population. This U.S. standard population is usually given in terms of a ‘‘standard million’’ in 10-year age groups. The U.S. standard million population is presented in appendix table I. A specific example of age-adjustment for one of the Healthy People 2000 mortality objectives using the standard million and 10-year ASDR is given in appendix table II. The calculation of the variances of the age-adjusted rates in table II are shown in appendix table III.
NCHS publishes a large number of ADRs based on the U.S. standard population. This standard is also used to track those mortality objectives in Healthy People 2000 and those Health Status Indicators which are monitored with age-adjusted rates. Several States use the same standard population in their publications. As long as the identical standard is used, ADRs from various national and State publications can be compared. But, if different standard populations are used to compute the ADR, then these ADRs are not comparable. Thus, there are considerable advantages to using the U.S. standard when computing State and local ADRs.
In recent years there have been discussions about whether the 1940 standard should be supplanted by a more contemporary ‘‘standard’’ that more closely reflects the U.S. population’s current (or future) age distribution. In considering this issue, it is important to remember that the actual magnitude of the ADR is beside the point. The ADR is an index number used for relative comparisons that should not be affected by the choice of a standard. (See section on ‘‘Wheot to adjust.’’) Although the magnitude of the ADRs may be greatly affected by the choice of a standard population, relative mortality, as measured by trends, race ratios, and sex ratios, is generally unaffected.
Despite this, controversy continues over which standard population to use when age adjusting death rates to measure temporal changes in cause-specific mortality. Examination of the issue shows that standard populations generally yield only a small effect on trend comparisons by cause of death. Thus, any standard population is adequate so long as comparison populations are not very ‘‘unusual’’ or ‘‘abnormal’’ with respect to the population under study. This means that the age distribution of the standard population should be somewhat similar to the population of interest.
Small number issues
One problem with ADR is that rates based on small numbers of deaths will exhibit a large amount of random variation. (See the technical appendix for more detail.) Therefore, if the number of deaths is small, mortality data should be aggregated over a number of years, or several small geographic areas must be combined into larger areas before computing the ADR (11). A very rough guideline is that there should be at least 25 total deaths over all age groups.
Wheot to adjust
The general consensus of the scientific literature is that, if it is appropriate to standardize, then the selection of the standard population should not affect relative comparisons. However, standardization is not appropriate when age specific death rates in the populations being compared do not have a consistent relationship.
For example, evaluating trends in age-adjusted cancer death rates over time can be difficult because the ASDRs for younger ages have been decreasing while death rates at older ages are increasing. If a relatively young standard population is used, the trend in ADR may show a small increase or even a decrease; if a relatively older standard population is used, cancer mortality shows a much larger increase. Thus, using a more current (i.e., older) population than the 1940 standard, such as the 1990 U.S. population, as a standard population yields a much greater increase in the cancer mortality trend curve than does an analysis using the 1940 standard. Under these circumstances, a single summary measure is likely to be inappropriate for describing trends over time. Here, one should not use ADR, but should look at trends among ASDRs.
This does not mean it is inappropriate to publish ADR for cancer mortality. Within a defined time interval, e.g., 1990, geographic or race-sex comparisons may still be appropriate. It can be noted that when age-adjusted rates are computed using two distinct standards and the comparisons are different, then it is not appropriate to standardize in the first place. Again, only age-specific comparisons may be valid. Kitagawa illustrates this situation with an example of the mortality of white males living in metropolitan counties compared with those residing ionmetropolitan counties in 1960. In this case, ASDRs for white males under age 40 were lower in metropolitan counties than in nonmetropolitan counties. After age 40, the reverse was true. A summary index, such as the ADR, does not adequately describe the mortality differentials in the two groups. In cases such as these, the ADR is an imprecise indicator of mortality; the age-specific comparisons would be a better choice.
Indirect standardization
Because of concerns with the use of ADR, some mortality analysts prefer indirect standardized rates. Indirect standardization is generally thought of as an approximation to direct standardization. That is, when data needed to compute a direct measure (e.g., ASDRs) are not available, there may still be enough information to compute an indirectly standardized measure. However, indirect standardization has intrinsic value and should be considered on its own merits, not solely as an approximation to direct standardization.
For indirect standardization, a standard set of age-specific death rates are assumed to apply to the observed population. For example, the age-specific U.S. death rates could be applied to the age-specific local area population. This technique yields an ‘‘expected’’ number of deaths in a population, assuming the standard set of ASDRs was operating in the population.
An indirect adjusted death rate (IADR) can be computed from the expected number of deaths, but the index most often used is the ratio of the expected to the actual observed number of deaths. This ratio is called the standardized mortality ratio (SMR). The mathematics of indirect standardization and an example of the calculation of an SMR are given in the appendix.
Reliability
Reliability is the consistency of a test for measurement – it ensures that a person who takes a test three times in a row should have scores close to each, and also ensures that two people with equal “intelligence” should have close scores in the test. Two types of reliability tests are usually performed:
Split-half Reliability: This involves randomly dividing a group of test takers into two halves, and then correlating the average scores of each half to the other. The closer the correlation is to positive, that is the closer the average scores are to each other, the more reliable the test.
Equivalent-form reliability/Test-retest reliability: This method makes test-takers take equivalent versions of the same test a few times; if the score is similar every time, the test is deemed reliable over time.
A note about SATs: The SAT I is a time-hardened tradition, and although there is much controversy about the validity of its results, few argue against the reliability of the test. In testretest reliability, students taking the test two or three times had an average range of about 50 points, relatively small to the total 1600. What does this mean for takers? Usually, as some guidance counselors and enlightened parents will be soon say, review courses and review books can help your score only to the extent that they will familiarize you with the test format and timing. Otherwise, the review courses do not give statistically significant increases in SAT I scores; the tempered SAT I test unfortunately has one of the most assured reliabilities of all standardized tests.
Validity
The validity of the test, usually considered the most important of the three factors, is how much a test really differentiates between students as it is supposed to, how accurate it is in its findings.
The content validity of a test reflects how well a test examines the entire range of material for its supposed purpose. For example, the SAT I test only verbal and mathematics sections, but is this enough content validity to determine academic success? As for tests like the
The concurrent validity reflects how proficient a taker is in a certain field at the moment, while predictive validity determines whether the taker will be proficient in the future in the field. The SAT I proclaims to test the “scholastic aptitude”, the potential success in college for high school students, but is this really valid? Some state statistics that high scorers on SATs generally have high grade point averages in college, but other state that the reason for these higher accomplishments is that, with these higher scores, they went to superior colleges with better academics. Others note that high scorer on the SATs often are not superior to lower scorers on less tangible aspects of college life.
Finally, and most importantly, is construct validity. This reflects on how well a certain tests correlates to other valid tests. If a person does well on his or her CAT exams, but does not do well on the SATs, does this mean one of the tests is invalid?
With the combination of all three factors, a perfect test can be formed. Unfortunately, there have beeone so far – no IQ test or standardized examination has been agreed upon by all to be fully successful for all three. Perhaps this is impossible.
Norming Distributions and Standardization
Since most psychological tests are not mastery tests with criterion references which determine performance, a different way must be used to classify scores as low or high.
In order to assess overall performance, most psychological tests employ a standardization sample which allows the test makers to create a normal distribution which can be used for comparison of any specific future test score.
Standardization Sample: a large sample of test takers who represent the population for which the test is intended. This standardization sample is also referred to as the norm group (or norming group).
We convert the raw scores of the sample group into percentiles in order to construct a normal distribution to allow us to rank future test takers.
Norms are not standards of performance, but serve as a frame of reference for test score interpretation.
Norm groups can range in size from a few hundred to a hundred thousand people. he more people we use in our norm group, the close the approximation to a normal population distribution we get.
Sampling methods for selecting a norming group
Sample must be representative: Test children if you are developing a test of children’s IQ; test adults if you are interested in assessing adult interests.
The closer the match between your sample and your intended population of test takers, the more accurate the distribution will be as a ranking guide.
Simple Random Sampling: every person in the target population has an equal chance of being in the standardization sample.
Stratified Sampling: Test developer takes into account all demographic variables which can accurately describe the population of interest and then selects individual at random, but proportional to the demographic portrait of the test population.
Most accurate way of developing norm group.
Common demographics to stratify: age, gender, socioeconomic status, geographic region.
Cluster Sampling: sampling begins by dividing a geographic region into blocks and then randomly sampling within those blocks.
More likely than random sampling to come up with a representative sample and less time consuming than stratified sampling.
Item Sampling
Often, test developers need to produce more than one version of a standardized test.
This is particularly important if you believe you will have an individual complete a psychological test more than once.
Item sampling refers to the procedure of giving two norm groups different items from the same exam.
This allows us to shorten the time it takes to conduct our representative sampling.
Difference between group norms and local norms:
Sometimes educators are interested how students performed relative to other students in the same grade, or other students in adjacent districts.
For these purposes, test scores will develop local norms for statistical comparison, rather than using the group norm supplied with the test. When scoring is done by computer, local norms can be easily developed.
Converting Raw scores into percentile ranks.
Remember, one major assumption in both psychology and psychological measurement is that all variables of psychological interest are normally distributed.
Since these variables fall into a normal distribution, we can specify what proportion of the population falls at or below (or at or above, or between) any score on a particular test. The average value is the midpoint of the distribution and has a percentile rank of 50 %. By knowing the mean (arithmetic average) and the standard deviation (average variation) of any psychological test, we can construct the normal distribution.
68 % of all scores fall within +/- 1 standard deviation from the mean.
96% of all scores fall within +/- 2 standard deviations from the mean
IQ distribution has a mean of 100 and a standard deviation of 15.
Specific Types of Normal Distributions commonly used in psychology.
Psychologists refer to these distributions often because there is a common reference for understanding raw scores of these particular distributions: The Z distribution: The Z distribution has a mean of 0 and a standard deviation of 1.
Extremely easy to tell from a Z score: Whether a score is above or below average (by the sign, positive or negative) Whether the score falls within average or deviate ranges. -1 to +1, an average score, -1 to -2 and +1 to +2, above or below average, Z scores <-2 or >+2 are atypical scores (outliers).
The T distribution: Has a mean of 50 and a standard deviation of 10. Easy to tell from a T score:
Whether a score is above or below average (T<50 below average, T>50 above average)
How far above or below because standard deviation is in units of ten.
Sometimes preferred to Z because negative T values are extremely rare.
Converting Raw Scores to Z scores and reverse
Z = (Raw Score – Average) / Standard Deviation
Through simple algebra, we can isolate any term we are interested in solving for:
Raw Score = ( Z * SD) + Average
Average = Raw Score – (Z * SD)
SD = (Raw Score – Average)/ Z
Understanding this relationship, we can convert a z score into any type of distribution we like.
T = 10Z + 50
SAT scores: 100Z +500
Parallel and Equated Tests
When more than one version of a standardized test is needed, alternate forms must be developed.
Parallel Forms: If the two tests have the same types and numbers of items of equal difficulty, the alternate versions are said to have parallel form.
Scores on parallel forms are highly correlated.
Parallel Forms are difficult to develop because the mean and standard deviation on both tests must be equivalent.
Equated Forms: When we can’t develop two alternate forms with the exact same mean and standard deviation, we can still compare tests of equivalent difficulty through the use of a common metric, for example the Z score distribution.
Item Response Theory can be used to equate difficulty and discriminability of two tests through linking