RELATIVE VALUES.
TYPES OF RELATIVE VALUES.
Science needs Statistics because things vary.
If the same conditions always produce the same result, you don’t need Statistics. But things do vary. Some variation is meaningful, some is not. Often the biggest challenge in science is to tell the difference between meaningful pattern and chancelike variability. (What‘s the signal? What‘s the noise?) Graphs and numerical summaries like averages and percentages can often reveal meaningful patterns that might otherwise remain hidden because of the variability.
Section 0.1 described our Sevenstep Method for statistical investigation. Step 4 is to describe your data. The goal of that step is to find summaries and plots that show patterns and help separate meaningful patterns from nuisance variation.
We start with a standard format for data: A data table (statistical spreadsheet) has one row for each observational unit, one column for each variable. Our goal as statisticians is to go from that table to numerical summaries and graphs that give us information about variability and pattern.
Example 0.2A: World Mental Health Survey
In Section 0.1 we talked about how Statistics is a discipline that guides us in weighing evidence about phenomena in the world around us. Typically, Statistics weighs evidence that comes in the form of data stored in a data file. The rows of the data file represent the observational units, which are the individuals (not necessarily people) being measured in the study. The columns represent the variables, the characteristics of the observational units. So, each entry in the data file gives the value of the variable for the observational unit of interest.
An ongoing effort of the World Mental Health Organization (WMHO) is to evaluate the frequency of mental health disorders and their impact on individuals in countries around the world.
Table 0.1 gives an example data file from a survey conducted by the WMHO of residents of the United States. The survey was conducted on a representative sample of 1,860 individuals living in the United States in 2001 to 2002. Notice the file is organized so each observational unit (in this case a person) occurs on a single row of the data file. For example, the first row is an 18 year old Hispanic male. The names or identifiers of the observational units are provided on the left hand side of the table; in this case, they are ID #‘s. The number of observational units is 1,860 because that is the sample size. So, if Table 0.1 was complete, it would have 1,860 rows—one for each person.
Notice also how each column of the data file gives information on a different characteristic of each observational unit. The names of the variables for this data table are provided in blue at the top of each column of the data table. As these data are from a survey, most columns represent the answer to a single question on the survey. For example, the second column is the answer to a question asking for the respondent‘s sex. It is important to note that sometimes variables don‘t have information on all of the observational units. Notice the ―Years Married‖ variable above, this variable only has information if a person‘s marital status is ―Married,‖ otherwise it contains nothing. Sometimes variables don‘t contain values for some observational units for legitimate reasons.
For example, if the variable does not apply to all observational units. In other cases, a variable might not have values for all observational units for less legitimate reasons, like a person skipped a question accidentally on a survey. Different statistical analysis programs have different ways of representing ―missing‖ data including ―”NA”, “.” or just leaving an empty box.
TYPES OF VARIABLES, THEIR VALUES, AND THEIR DISTRIBUTIONS
As the name suggests, a variable varies, that is, it takes on different values for different cases. Depending on its values, a variable is either quantitative or categorical. For a quantitative variable, it makes sense to do arithmetic (add, subtract, etc.) with the values. Examples are height, weight, distance and time. For a categorical variable the values are labels for which arithmetic does not make sense. Examples are sex, ethnicity, and eye color. The two kinds of variables lead to different kinds of summaries. For example, you can compute an average value or median for a quantitative variable like height, but not for a categorical variable like ethnicity. Much of the rest of this section illustrates some useful summaries, but first, you need the key idea of a distribution. Statistics relies on looking at a lot of cases all at once, rather than one case at a time. The key idea is the distribution of a variable:
For large datasets like the WMHO survey, it is hard to detect patterns among the thousands of cases just by looking at a list of values. By thinking instead of the distribution as a whole, we are led to various ways to describe, summarize and compare distributions, much as a naturalist would describe and compare different plants or animals.
Summaries for distributions
The most common summaries for distributions are either numerical or graphical. You don‘t need a definition, because the names mean what you would expect, and you can get the idea from examples. Here are several based on the WMHO survey:
Numerical summaries, categorical variables:
The proportion of females in the survey is 0.553.
The proportion of Hispanics in the survey is 0.097.
Graphical summaries, categorical variable:
Numerical summaries, quantitative variables:
Average age for married individuals is 52.
Average age for those who have never married is 42.
Graphical summaries, quantitative variables:
Types of data
Just as a farmer gathers and processes a crop, a statistician gathers and processes data. For this reason the logo for the UK Royal Statistical Society is a sheaf of wheat. Like any farmer who knows instinctively the difference between oats, barley and wheat, a statistician becomes an expert at discerning different types of data. Some sections of this book refer to different data types and so we start by considering these distinctions. Figure 1.2 shows a basic summary of data types, although some data do not fi t neatly into these categories.
Categorical or qualitative data
Nominal categorical data
Nominal or categorical data are data that one can name and put into categories. They are not measured but simply counted. They often consist of unordered ‘either–or’ type observations which have two categories and are often know as binary. For example: Dead or Alive; Male or Female; Cured or Not Cured; Pregnant or Not Pregnant. In Table 1.1 having a firstdegree relative with cancer, or taking regular exercise are binary variables. However, categorical data often can have more that two categories, for example: blood group O, A, B, AB, country of origin, ethnic group or eye colour. In Table 1.1 marital status is of this type. The methods of presentation of nominal data are limited in scope. Thus, Table 1.1 merely gives the number and percentage of people by marital status.
Ordinal data
If there are more than two categories of classification it may be possible to order them in some way. For example, after treatment a patient may be either improved, the same or worse; a woman may never have conceived, conceived but spontaneously aborted, or given birth to a live infant. In Table 1.1 education is given in three categories: none or elementary school, middle school, college and above. Thus someone who has been to middle school has more education than someone from elementary school but less than someone from college. However, without further knowledge it would be wrong to ascribe a numerical quantity to position; one cannot say that someone who had middle school education is twice as educated as someone who had only elementary school education. This type of data is also known as ordered categorical data.
Ranks
In some studies it may be appropriate to assign ranks. For example, patients with rheumatoid arthritis may be asked to order their preference for four dressing aids. Here, although numerical values from 1 to 4 may be assigned to each aid, one cannot treat them as numerical values. They are in fact only codes for best, second best, third choice and worst.
Numerical or quantitative data
Count data
Table 1.1 gives details of the number of pregnancies each woman had had, and this is termed count data. Other examples are often counts per unit of time such as the number of deaths in a hospital per year, or the number of attacks of asthma a person has per month. In dentistry, a common measure is the number of decayed, filled or missing teeth (DFM).
Measured or numerical continuous
Such data are measurements that can, in theory at least, take any value within a given range. These data contain the most information, and are the ones most commonly used in statistics. Examples of continuous data in Table 1.1 are: age, years of menstruation and body mass index.
However, for simplicity, it is often the case in medicine that continuous data are dichotomised to make nominal data. Thus diastolic blood pressure, which is continuous, is converted into hypertension (>90 mmHg) and normotension (≤90 mmHg). This clearly leads to a loss of information. There are two main reasons for doing this. It is easier to describe a population by the proportion of people affected (for example, the proportion of people in the population with hypertension is 10%). Further, one often has to make a decision: if a person has hypertension, then they will get treatment, and this too is easier if the population is grouped.
One can also divide a continuous variable into more than two groups. In Table 1.1 per capita income is a continuous variable and it has been divided into four groups to summarise it, although a better choice may have been to split at the more convenient and memorable intervals of 4000, 6000 and 8000 yuan. The authors give no indication as to why they chose these cutoff points, and a reader has to be very wary to guard against the fact that the cuts may be chosen to make a particular point.
Interval and ratio scales
One can distinguish between interval and ratio scales. In an interval scale, such as body temperature or calendar dates, a difference between two measurements has meaning, but their ratio does not. Consider measuring temperature (in degrees centigrade) then we cannot say that a temperature of 20°C is twice as hot as a temperature of 10° C. In a ratio scale, however, such as body weight, a 10% increase implies the same weight increase whether expressed in kilograms or pounds. The crucial difference is that in a ratio scale, the value of zero has real meaning, whereas in an interval scale, the position of zero is arbitrary.
One difficulty with giving ranks to ordered categorical data is that one cannot assume that the scale is interval. Thus, as we have indicated when discussing ordinal data, one cannot assume that risk of cancer for an individual educated to middle school level, relative to one educated only to primary school level is the same as the risk for someone educated to college level, relative to someone educated to middle school level. Were Xu et al (2004) simply to score the three levels of education as 1, 2 and 3 in their subsequent analysis, then this would imply in some way the intervals have equal weight. 1.5
How a statistician can help
Statistical ideas relevant to good design and analysis are not easy and we would always advise an investigator to seek the advice of a statistician at an early stage of an investigation. Here are some ways the medical statistician might help.
Sample size and power considerations
One of the commonest questions asked of a consulting statistician is: How large should my study be? If the investigator has a reasonable amount of knowledge as to the likely outcome of a study, and potentially large resources of finance and time, then the statistician has tools available to enable a scientific answer to be made to the question. However, the usual scenario is that the investigator has either a grant of a limited size, or limited time, or a limited pool of patients. Nevertheless, given certain assumptions the medical statistician is still able to help. For a given number of patients the probability of obtaining effects of a certain size can be calculated. If the outcome variable is simply success or failure, the statistician will need to know the anticipated percentage of successes in each group so that the difference between them can be judged of potential clinical relevance. If the outcome variable is a quantitative measurement, he will need to know the size of the difference between the two groups, and the expected variability of the measurement. For example, in a survey to see if patients with diabetes have raised blood pressure the medical statistician might say, ‘with 100 diabetics and 100 healthy subjects in this survey and a possible difference in blood pressure of 5 mmHg, with standard deviation of 10 mmHg, you have a 20% chance of obtaining a statistically significant result at the 5% level’. This statement means that one would anticipate that in only one study in five of the proposed size would a statistically significant result be obtained. The investigator would then have to decide whether it was sensible or ethical to conduct a trial with such a small probability of success. One option would be to increase the size of the survey until success (defined as a statistically significant result if a difference of 5 mmHg or more does truly exist) becomes more probable.
Questionnaires
Rigby et al (2004), in their survey of original articles in three UK general practice journals, found that the most common design was that of a crosssectional or questionnaire survey, with approximately one third of the articles classified as such.
For all but the smallest data sets it is desirable to use a computer for statistical analysis. The responses to a questionnaire will need to be easily coded for computer analysis and a medical statistician may be able to help with this. It is important to ask for help at an early stage so that the questionnaire can be piloted and modified before use in a study.
Choice of sample and of control subjects
The question of whether one has a representative sample is a typical problem faced by statisticians. For example, it used to be believed that migraine was associated with intelligence, perhaps on the grounds that people who used their brains were more likely to get headaches but a subsequent population study failed to reveal any social class gradient and, by implication, any association with intelligence. The fallacy arose because intelligent people were more likely to consult their physician about migraine than the less intelligent.
In many studies an investigator will wish to compare patients suffering from a certain disease with healthy (control) subjects. The choice of the appropriate control population is crucial to a correct interpretation of the results.
Design of study
It has been emphasised that design deserves as much consideration as analysis, and a statistician can provide advice on design. In a clinical trial, for example, what is known as a doubleblind randomised design is nearly always preferable, but not always achievable. If the treatment is an intervention, such as a surgical procedure it might be impossible to prevent individuals knowing which treatment they are receiving but it should be possible to shield their assessors from knowing.
Laboratory experiments
Medical investigators often appreciate the effect that biological variation has in patients, but overlook or underestimate its presence in the laboratory. In dose–response studies, for example, it is important to assign treatment at random, whether the experimental units are humans, animals or test tubes. A statistician can also advise on quality control of routine laboratory measurements and the measurement of within and betweenobserver variation.
Displaying data
A wellchosen figure or graph can summarise the results of a study very concisely. A statistician can help by advising on the best methods of displaying data. For example, when plotting histograms, choice of the group interval can affect the shape of the plotted distribution; with too wide an interval important features of the data will be obscured; too narrow an interval and random variation in the data may distract attention from the shape of the underlying distribution.
Choice of summary statistics and statistical analysis
The summary statistics used and the analysis undertaken must reflect the basic design of the study and the nature of the data. In some situations, for example, a median is a better measure of location than a mean. In a matched study, it is important to produce an estimate of the difference between matched pairs, and an estimate of the reliability of that difference. For example, in a study to examine blood pressure measured in a seated patient compared with that measured when he is lying down, it is insufficient simply to report statistics for seated and lying positions separately. The important statistic is the change in blood pressure as the patient changes position and it is the mean and variability of this difference that we are interested in. This is further discussed in Chapter 8. A statistician can advise on the choice of summary statistics, the type of analysis and the presentation of the results.
STATISTICAL DISTRIBUTIONS
Every statistics book provides a listing of statistical distributions, with their properties, but browsing through these choices can be frustrating to anyone without a statistical background, for two reasons. First, the choices seem endless, with dozens of distributions competing for your attention, with little or no intuitive basis for differentiating between them. Second, the descriptions tend to be abstract and emphasize statistical properties such as the moments, characteristic functions and cumulative distributions. In this appendix, we will focus on the aspects of distributions that are most useful when analyzing raw data and trying to fit the right distribution to that data.
Fitting the Distribution
When confronted with data that needs to be characterized by a distribution, it is best to start with the raw data and answer four basic questions about the data that can help in the characterization. The first relates to whether the data can take on only discrete values or whether the data is continuous; whether a new pharmaceutical drug gets FDA approval or not is a discrete value but the revenues from the drug represent a continuous variable. The second looks at the symmetry of the data and if there is asymmetry, which direction it lies in; in other words, are positive and negative outliers equally likely or is one more likely than the other. The third question is whether there are upper or lower limits on the data; there are some data items like revenues that cannot be lower than zero whereas there are others like operating margins that cannot exceed a value (100%). The final and related question relates to the likelihood of observing extreme values in the distribution; in some data, the extreme values occur very infrequently whereas in others, they occur more often.
Is the data discrete or continuous?
The first and most obvious categorization of data should be on whether the data is restricted to taking on only discrete values or if it is continuous. Consider the inputs into a typical project analysis at a firm. Most estimates that go into the analysis come from distributions that are continuous; market size, market share and profit margins, for instance, are all continuous variables. There are some important risk factors, though, that can take on only discrete forms, including regulatory actions and the threat of a terrorist attack; in the first case, the regulatory authority may dispense one of two or more decisions which are specified up front and in the latter, you are subjected to a terrorist attack or you are not.
With discrete data, the entire distribution can either be developed from scratch or the data can be fitted to a prespecified discrete distribution. With the former, there are two steps to building the distribution. The first is identifying the possible outcomes and the second is to estimate probabilities to each outcome. As we noted in the text, we can draw on historical data or experience as well as specific knowledge about the investment being analyzed to arrive at the final distribution. This process is relatively simple to accomplish when there are a few outcomes with a wellestablished basis for estimating probabilities but becomes more tedious as the number of outcomes increases. If it is difficult or impossible to build up a customized distribution, it may still be possible fit the data to one of the following discrete distributions:
a. Binomial distribution: The binomial distribution measures the probabilities of the number of successes over a given number of trials with a specified probability of success in each try. In the simplest scenario of a coin toss (with a fair coin), where the probability of getting a head with each toss is 0.50 and there are a hundred trials, the binomial distribution will measure the likelihood of getting anywhere from no heads in a hundred tosses (very unlikely) to 50 heads (the most likely) to 100 heads (also very unlikely). The binomial distribution in this case will be symmetric, reflecting the even odds; as the probabilities shift from even odds, the distribution will get more skewed. Figure 6A.1 presents binomial distributions for three scenarios – two with 50% probability of success and one with a 70% probability of success and different trial sizes.
Figure 6A.1: Binomial Distribution
As the probability of success is varied (from 50%) the distribution will also shift its shape, becoming positively skewed for probabilities less than 50% and negatively skewed for probabilities greater than 50%.
b. Poisson distribution: The Poisson distribution measures the likelihood of a number of events occurring within a given time interval, where the key parameter that is required is the average number of events in the given interval (l). The resulting distribution looks similar to the binomial, with the skewness being positive but decreasing with l. Figure 6A.2 presents three Poisson distributions, with l ranging from 1 to 10.
Figure 6A.2: Poisson Distribution
c. Negative Binomial distribution: Returning again to the coin toss example, assume that you hold the number of successes fixed at a given number and estimate the number of tries you will have before you reach the specified number of successes. The resulting distribution is called the negative binomial and it very closely resembles the Poisson. In fact, the negative binomial distribution converges on the Poisson distribution, but will be more skewed to the right (positive values) than the Poisson distribution with similar parameters.
d. Geometric distribution: Consider again the coin toss example used to illustrate the binomial. Rather than focus on the number of successes in n trials, assume that you were measuring the likelihood of when the first success will occur. For instance, with a fair coin toss, there is a 50% chance that the first success will occur at the first try, a 25% chance that it will occur on the second try and a 12.5% chance that it will occur on the third try. The resulting distribution is positively skewed and looks as follows for three different probability scenarios (in figure 6A.3):
Figure 6A.3: Geometric Distribution
Note that the distribution is steepest with high probabilities of success and flattens out as the probability decreases. However, the distribution is always positively skewed.
e. Hypergeometric distribution: The hypergeometric distribution measures the probability of a specified number of successes in n trials, without replacement, from a finite population. Since the sampling is without replacement, the probabilities can change as a function of previous draws. Consider, for instance, the possibility of getting four face cards in hand of ten, over repeated draws from a pack. Since there are 16 face cards and the total pack contains 52 cards, the probability of getting four face cards in a hand of ten can be estimated. Figure 6A.4 provides a graph of the hypergeometric distribution:
Figure 6A.4: Hypergeometric Distribution
Note that the hypergeometric distribution converges on binomial distribution as the as the population size increases.
f. Discrete uniform distribution: This is the simplest of discrete distributions and applies when all of the outcomes have an equal probability of occurring. Figure 6A.5 presents a uniform discrete distribution with five possible outcomes, each occurring 20% of the time:
Figure 6A.5: Discrete Uniform Distribution
The discrete uniform distribution is best reserved for circumstances where there are multiple possible outcomes, but no information that would allow us to expect that one outcome is more likely than the others.
With continuous data, we cannot specify all possible outcomes, since they are too numerous to list, but we have two choices. The first is to convert the continuous data into a discrete form and then go through the same process that we went through for discrete distributions of estimating probabilities. For instance, we could take a variable such as market share and break it down into discrete blocks – market share between 3% and 3.5%, between 3.5% and 4% and so on – and consider the likelihood that we will fall into each block. The second is to find a continuous distribution that best fits the data and to specify the parameters of the distribution. The rest of the appendix will focus on how to make these choices.
How symmetric is the data?
There are some datasets that exhibit symmetry, i.e., the upside is mirrored by the downside. The symmetric distribution that most practitioners have familiarity with is the normal distribution, sown in Figure 6A.6, for a range of parameters:
Figure 6A.6: Normal Distribution
The normal distribution has several features that make it popular. First, it can be fully characterized by just two parameters – the mean and the standard deviation – and thus reduces estimation pain. Second, the probability of any value occurring can be obtained simply by knowing how many standard deviations separate the value from the mean; the probability that a value will fall 2 standard deviations from the mean is roughly 95%. The normal distribution is best suited for data that, at the minimum, meets the following conditions:
a. There is a strong tendency for the data to take on a central value.
b. Positive and negative deviations from this central value are equally likely
c. The frequency of the deviations falls off rapidly as we move further away from the central value.
The last two conditions show up when we compute the parameters of the normal distribution: the symmetry of deviations leads to zero skewness and the low probabilities of large deviations from the central value reveal themselves in no kurtosis.
There is a cost we pay, though, when we use a normal distribution to characterize data that is nonnormal since the probability estimates that we obtain will be misleading and can do more harm than good. One obvious problem is when the data is asymmetric but another potential problem is when the probabilities of large deviations from the central value do not drop off as precipitously as required by the normal distribution. In statistical language, the actual distribution of the data has fatter tails than the normal. While all of symmetric distributions in the family are like the normal in terms of the upside mirroring the downside, they vary in terms of shape, with some distributions having fatter tails than the normal and the others more accentuated peaks. These distributions are characterized as leptokurtic and you can consider two examples. One is the logistic distribution, which has longer tails and a higher kurtosis (1.2, as compared to 0 for the normal distribution) and the other are Cauchy distributions, which also exhibit symmetry and higher kurtosis and are characterized by a scale variable that determines how fat the tails are. Figure 6A.7 present a series of Cauchy distributions that exhibit the bias towards fatter tails or more outliers than the normal distribution.
Figure 6A.7: Cauchy Distribution
Either the logistic or the Cauchy distributions can be used if the data is symmetric but with extreme values that occur more frequently than you would expect with a normal distribution.
As the probabilities of extreme values increases relative to the central value, the distribution will flatten out. At its limit, assuming that the data stays symmetric and we put limits on the extreme values on both sides, we end up with the uniform distribution, shown in figure 6A.8:
Figure 6A.8: Uniform Distribution
When is it appropriate to assume a uniform distribution for a variable? One possible scenario is when you have a measure of the highest and lowest values that a data item can take but no real information about where within this range the value may fall. In other words, any value within that range is just as likely as any other value.
Most data does not exhibit symmetry and instead skews towards either very large positive or very large negative values. If the data is positively skewed, one common choice is the lognormal distribution, which is typically characterized by three parameters: a shape (s or sigma), a scale (m or median) and a shift parameter (). When m=0 and =1, you have the standard lognormal distribution and when =0, the distribution requires only scale and sigma parameters. As the sigma rises, the peak of the distribution shifts to the left and the skewness in the distribution increases. Figure 6A.9 graphs lognormal distributions for a range of parameters:
Figure 6A.9: Lognormal distribution
The Gamma and Weibull distributions are two distributions that are closely related to the lognormal distribution; like the lognormal distribution, changing the parameter levels (shape, shift and scale) can cause the distributions to change shape and become more or less skewed. In all of these functions, increasing the shape parameter will push the distribution towards the left. In fact, at high values of sigma, the left tail disappears entirely and the outliers are all positive. In this form, these distributions all resemble the exponential, characterized by a location (m) and scale parameter (b), as is clear from figure 6A.10.
Figure 6A.10: Weibull Distribution
The question of which of these distributions will best fit the data will depend in large part on how severe the asymmetry in the data is. For moderate positive skewness, where there are both positive and negative outliers, but the former and larger and more common, the standard lognormal distribution will usually suffice. As the skewness becomes more severe, you may need to shift to a threeparameter lognormal distribution or a Weibull distribution, and modify the shape parameter till it fits the data. At the extreme, if there are no negative outliers and the only positive outliers in the data, you should consider the exponential function, shown in Figure 6a.11:
Figure 6A.11: Exponential Distribution
If the data exhibits negative slewness, the choices of distributions are more limited. One possibility is the Beta distribution, which has two shape parameters (p and q) and upper and lower bounds on the data (a and b). Altering these parameters can yield distributions that exhibit either positive or negative skewness, as shown in figure 6A.12:
Figure 6A.12: Beta Distribution
Are there upper or lower limits on data values?
There are often natural limits on the values that data can take on. As we noted earlier, the revenues and the market value of a firm cannot be negative and the profit margin cannot exceed 100%. Using a distribution that does not constrain the values to these limits can create problems. For instance, using a normal distribution to describe profit margins can sometimes result in profit margins that exceed 100%, since the distribution has no limits on either the downside or the upside.
When data is constrained, the questions that needs to be answered are whether the constraints apply on one side of the distribution or both, and if so, what the limits on values are. Once these questions have been answered, there are two choices. One is to find a continuous distribution that conforms to these constraints. For instance, the lognormal distribution can be used to model data, such as revenues and stock prices that are constrained to be never less than zero. For data that have both upper and lower limits, you could use the uniform distribution, if the probabilities of the outcomes are even across outcomes or a triangular distribution (if the data is clustered around a central value). Figure 6A.14 presents a triangular distribution:
Figure 6A.14: Triangular Distribution
An alternative approach is to use a continuous distribution that normally allows data to take on any value and to put upper and lower limits on the values that the data can assume. Note that the cost of putting these constrains is small in distributions like the normal where the probabilities of extreme values is very small, but increases as the distribution exhibits fatter tails.
How likely are you to see extreme values of data, relative to the middle values?
As we noted in the earlier section, a key consideration in what distribution to use to describe the data is the likelihood of extreme values for the data, relative to the middle value. In the case of the normal distribution, this likelihood is small and it increases as you move to the logistic and Cauchy distributions. While it may often be more realistic to use the latter to describe real world data, the benefits of a better distribution fit have to be weighed off against the ease with which parameters can be estimated from the normal distribution. Consequently, it may make sense to stay with the normal distribution for symmetric data, unless the likelihood of extreme values increases above a threshold.
The same considerations apply for skewed distributions, though the concern will generally be more acute for the skewed side of the distribution. In other words, with positively skewed distribution, the question of which distribution to use will depend upon how much more likely large positive values are than large negative values, with the fit ranging from the lognormal to the exponential.
Relative values
As a result of statistical research during processing of the statistical data of disease, mortality rate, lethality, etc. absolute numbers are received, which specify the number of the phenomena. Though absolute numbers have a certain cognitive values, but their use is limited. For determination of a level of the phenomenon, for comparison of a parameter in dynamics or with a parameter of other territory it is necessary to calculate relative values (parameters, factors) which represent result of a ratio of statistical numbers between itself. The basic arithmetic action at subtraction of relative values is division.
In medical statistics themselves the following kinds of relative parameters are used:
— Extensive;
— Intensive;
— Relative intensity;
— Visualization;
— Correlation.
For the determination of a structure of disease (mortality rate, lethality, etc.) the extensive parameter is used.
The extensive parameter or a parameter of distribution characterizes a parts of the phenomena (structure), that is it shows, what part from the general number of all diseases (died) is made with this or that disease which enters into total.
Using this parameter, it is possible to determine the structure of patients according to age, social status, etc. It is accepted to express this parameter in percentage, but it can be calculated and in parts per thousand case when the part of the given disease is small and at the calculation in percentage it is expressed as decimal fraction, instead of an integer.
The general formula of its subtraction is the following:
Technique of the calculation of an extensive parameter will be shown on an example.
To determine an age structure of those who has addressed in a polyclinic if the following data is known:
Number of addressed — 1500 it is accepted by 100 %, number of patients of each age — accordingly for X, from here per cent of what have addressed in a polyclinic in the age of 1519 years from the general number, will make:
Table 2.5 Age groups of people, which have visit to polyclinic
Age group 
Absolute number 
% from the general number 
15 – 19 
150 
10,0 
20 – 29 
375 
25,0 
30 – 39 
300 
20,0 
40 – 49 
345 
23.0 
50 – 59 
150 
10.0 
60 and senior 
180 
12.0 
In total 
1500 
100.0 
Conclusion: most of the people that have addressed in a polyclinic were in the age of 2029 and 4049 years.
The extensive parameter at the analysis needs to be used carefully and we must remember that it is used only for the characteristic of structure of the phenomena in the given place and at present time. Comparison of a structure makes it possible to tell only about change of a serial number of the given diseases in structure of diseases.
If it is necessary to determine distribution of the phenomenon intensive parameters are used.
The intensive parameter characterizes frequency or distribution.
It shows how frequently the given phenomenon occurs in the given environment.
For example, how frequently there is this or that disease among the population or how frequently people are dying from this or that disease.
To calculate the intensive parameter, it is necessary to know the population or the contingent.
General formula of the calculation is the following:
phenomenon × 100 (1000; 10 000; 100 000)
environment
Intensive parameters are calculated on 1000 persons. These are parameters of birth, morbidity, mortality, etc.; on separate disease they are being calculated on 10.000 and disease, which occurs seldom — on 100000 persons.
Let' s consider a technique of its subtraction on an example.
Example. Number of died in the area — 175, number of the population at the beginning of year — 24000, at the end of year — 26000. To determine a parameter of mortality :
General mortality = number of died during the year × 1000
rate number of the population
We determine an average value of the population; for this purpose we take the number of the population to the beginning of year plus number of the population at the end of year and divide it by 2:
We make a proportion: 175 persons, who died correspond to 25000 people, and how many persons, who died correspond to 1000?
175  25000
X  1000
Parameters of birth, morbidity are calculated similarly etc.
Table 2.6 Structure of morbidity, invalidity and the reasons of mortality
Disease 
Structure of morbidity

Structure of invalidity

Structure of the reasons of death

Index of relative intensity 

Of invalidity 
reasons of death 

Traumas 
12.0 
8.0 
30.0 
0.35 
2.0 
Heart and vessel diseases 
4.0

27.0 
19.0 
6.76 
4.75 
Diseases of nervous system 
6.0 
8.0 
 
1.33 
 
Poisonings 
0.3 
 
0.4 
 
13.3 
Tuberculosis 
0.5 
5.0 
5.5 
10.0 
11.0 
Other

74.2 
52.0 
41.5 
0.7 
0.56 
Total 
100.0 
100.0 
100.0 
 
 
Parameters of relative intensity represent a numerical ratio of two or several structures of the same elements of a set, which is studied.
They allow determining a degree of conformity (advantage or reduction) of similar attributes and are used as auxiliary reception; in those cases where it isn’t possible to receive direct intensive parameters or if it is necessary to measure a degree of a disproportion in structure of two or several close processes.
For example, there are data only about structure of the general morbidity, physical disability and mortality rate.
Comparison of these structures and subtraction of parameters of relative intensity allows finding out the relative importance of these or those diseases in health parameters of the population.
So, for example, comparison of densities of physical disability and mortality rates from cardiovascular diseases with its densities in morbidity allows to determine, that cardiovascular diseases occupy almost in 7 times more part in physical disability and almost in 5 times — in mortality , than in structure of morbidity .
Procedure of the calculation of these parameters is the following:
For example, densities of cardiovascular diseases in structures:
— General morbidity  4,0 %;
— Disability  27,0 %;
— Reasons of mortality  19,0 %.
The parameter of relative intensity of mortality is received in the similar way.
Thus, parameters of relative intensity represent parameters of a disproportion of particles of the same elements in the structure of processes, which are studying.
The parameter of correlation characterizes the relation between diverse values.
For example, the parameter of average bed occupancy, nurses, etc.
The techniques of subtraction of the correlation parameter is the same as for intensive parameter, nevertheless the number of an intensive parameter stands in the numerator, is included into denominator, where as in a parameter of visualization of numerator and denominator different.
The parameter of visualization characterizes the relation of any of comparable values to the initial level accepted for 100. This parameter is used for convenience of comparison, and also in case shows a direction of process (increase, reduction) not showing a level or the numbers of the phenomenon.
It can be used for the characteristic of dynamics of the phenomena, for comparison on separate territories, in different groups of the population, for the construction of graphic.
Table 2.7 For example. Expression of parameters of visits to polyclinic
Polyclinic 
Number of visits 
Parameter of presentation = Polyclinic № 1 (100%) 
№ 1 
850 
100,0 
№ 2 
920 
108,1 
№ 3 
990 
116,1 
№ 4 
1200 
141,1 
№ 5 
1290 
151,7 
It is possible to calculate visualization parameters, using absolute numbers, intensive parameters, parameters of correlation, average values, but not extensive parameters, taking into account the above mentioned about this parameter.
It is enough to calculate parameters with the practical purpose to within one tenth.
To determine the tenth share, it is necessary to make calculation to the second sign after a point.
Depending on, whether there will be a second sign more than five or less, the first sign after a point is determined, in the first case it increases for a unit, in the second – it remains the same.
Relative value studies are lists of "relative values" of different professional services. Antitrust enforcement agencies recently have filed several complaints against the promulgation of relative value studies by professional organizations of physicians, alleging that these lists have been used to fix and increase physician fees. This article examines the status of professionally sponsored relative values studies under antitrust law and suggests several reasons why they should be held unlawful. In particular, such relative value studies threaten to eliminate desirable competition among private thirdparty payers in the development of effective costcontainment strategies as well as among physicians in the setting of fees. Moreover, the alleged benefits of professionally sponsored relative value studies could be achieved by alternative means that do not similarly restrict competition in the provision of medical services.
Relative value unit (RVU), a comparable service measure used by hospitals to permit comparison of the amounts of resources required to perform various services within a single department or between departments. It is determined by assigning weight to such factors as personnel time, level of skill, and sophistication of equipment required to render patient services. RVUs are a common method of physician bonus plans based partially on productivity.
Describing and displaying categorical data Summary
This chapter illustrates methods of summarising and displaying binary and categorical data. It covers proportions, risk and rates, relative risk, and odds ratios. The importance of considering the absolute risk difference as well as the relative risk is emphasized.
Summarising categorical data
Binary data are the simplest type of data. Each individual has a label which takes one of two values. A simple summary would be to count the different types of label. However, a raw count is rarely useful. Furness et al (2003) reported more accidents to white cars than to any other colour car in Auckland, New Zealand over a 1year period. As a consequence, a New Zealander may think twice about buying a white car! However, it turns out that there are simply more white cars on the Auckland roads than any other colour. It is only when this count is expressed as a proportion that it becomes useful. When Furness et al (2003) looked at the proportion of white cars that had accidents compared to the proportion of all cars that had accidents, they found the proportions very similar and so white cars are not more dangerous than other colours. Hence the first step to analysing categorical data is to count the number of observations in each category and express them as proportions of the total sample size. Proportions are a special example of a ratio. When time is also involved (as in counts per year) then it is known as a rate. These distinctions are given below.
Labelling binary outcomes
For binary data it is common to call the outcome ‘an event’ and ‘a nonevent’. So having a car accident in Auckland, New Zealand may be an ‘event’. We often score an ‘event’ as 1 and a ‘nonevent’ as 0. These may also be referred to as a ‘positive’ or ‘negative’ outcome or ‘success’ and ‘failure’. It is important to realise that these terms are merely labels and the main outcome of interest might be a success in one context and a failure in another. Thus in a study of a potentially lethal disease the outcome might be death, whereas in a disease that can be cured it might be being alive.
Comparing outcomes for binary data
Many studies involve a comparison of two groups. We may wish to combine simple summary measures to give a summary measure which in some way shows how the groups differ. Given two proportions one can either subtract one from the other, or divide one by the other.
Suppose the results of a clinical trial, with a binary categorical outcome (positive or negative), to compare two treatments (a new test treatment versus a control) are summarised in a 2 × 2 contingency table as in Table 2.3. Then the results of this trial can be summarised in a number of ways.
The ways of summarising the data presented in Table 2.3 are given below.
Each of the above measures summarises the study outcomes, and the one chosen may depend on how the test treatment behaves relative to the control. Commonly, one may chose an absolute risk difference for a clinical trial and a relative risk for a prospective study. In general the relative risk is independent of how common the risk factor is. Smoking increases one’s risk of lung cancer by a factor of 10, and this is true in countries with a high smoking prevalence and countries with a low smoking prevalence. However, in a clinical trial, we may be interested in what reduction in the proportion of people with poor outcome a new treatment will make.
Summarising binary data – odds and odds ratios
A further method of summarising the results is to use the odds of an event rather than the probability. The odds of an event are defined as the ratio of the probability of occurrence of the event to the probability of nonoccurrence, that is, p/(1 − p).
Using the notation of Table 2.3 we can see that the odds of an outcome for the test group to the odds of an outcome for control group is the ratio of odds for test group to the odds for control group:
The odds ratio (OR) is
The odds ratio (OR) from Table 2.3 is:
When the probability of an event happening is rare, the odds and probabilities are close, because then a is much smaller than c and so a/(a + c) is approximately a/c and b is much smaller than d and so b/(b + d) is approximately b/d. Thus the OR approximates the RR when the successes are rare (say with a maximum incidence less than 10% of either pTest or pControl) Sometime the odds ratio is referred to as ‘the approximate relative risk’. The approximation is demonstrated in Table 2.5.
Why should one use the odds ratio?
The calculation for an odds ratio (OR) may seem rather perverse, given that we can calculate the relative risk directly from the 2 × 2 table and the odds ratio is only an approximation of this. However, the OR appears quite often in the literature, so it is important to be aware of it. It has certain mathematical properties that render it attractive as an alternative to the RR as a summary measure. Indeed, some statisticians argue that the odds ratio is the natural parameter and the relative risk merely an approximation. The OR features in logistic regression and as a natural summary measure for case–control studies
One point about the OR that can be seen immediately from the formula is that the OR for Failure as opposed to the OR for Success in Table 2.3 is given by OR = bc/ad. Thus the OR for Failure is just the inverse of the OR for Success.
Thus in the cannabis and psychosis study, the odds ratio of not developing psychosis for the cannabis group is 1/1.79 = 0.56. In contrast the relative risk of not developing psychosis is (1 − 0.26)/(1 − 0.16) = 0.88, which is not the same as the inverse of the relative risk of developing psychosis for the cannabis group which is 1/1.625 = 0.62.
This symmetry of interpretation of the OR is one of the reasons for its continued use.
Relative value unit
Health insurance A comparative financial unit that may sometimes be used instead of dollar amounts in a surgical schedule, this number is multiplied by a conversion factor to arrive at the surgical benefit to be paid.
The dbVar database has been developed to archive information associated with large scale genomic variation, including large insertions, deletions, translocations and inversions. In addition to archiving variation discovery, dbVar also stores associations of defined variants with phenotype information.
Archives and distributes the results of studies that have investigated the interaction of genotypes and phenotypes. Such studies include those assessing genomewide association, medical sequencing, molecular diagnostic assays, as well as association between genotype and nonclinical traits.
Provides an open, publicly accessible platform where the HLA community can submit, edit, view, and exchange data related to the human Major Histocompatibility Complex. It consists of an interactive Alignment Viewer for HLA and related genes, an MHC microsatellite database, a sequence interpretation site for Sequencing Based Typing (SBT), and a Primer/Probe database.
Includes single nucleotide polymorphisms, microsatellites, and smallscale insertions and deletions. dbSNP contains populationspecific frequency and genotype data, experimental conditions, molecular context, and mapping information for both neutral polymorphisms and clinical mutations.
A database of static NCBI web pages, documentation, and online tools. These pages include such content as specialized online sequence analysis tools, back issues of newsletters, legacy resource description pages, sample code, and other miscellaneous resources. Searching this database is equivalent to a site search tool for the whole NCBI web site. FTP site is not covered.
Openaccess data generally include summaries of genotype/phenotype association studies, descriptions of the measured variables, and study documents, such as the protocol and questionnaires. Access to individuallevel data, including phenotypic data tables and genotypes, requires varying levels of authorization.
A public domain quality assurance software package that facilitates the assessment of multiplex short tandem repeat (STR) DNA profiles based on laboratoryspecific protocols. OSIRIS evaluates the raw electrophoresis data using an independently derived mathematicallybased sizing algorithm. It offers two new peak quality measures  fit level and sizing residual. It can be customized to accommodate laboratoryspecific signatures such as background noise settings, customized naming conventions and additional internal laboratory controls.
A variety of tools are available for searching the SNP database, allowing search by genotype, method, population, submitter, markers and sequence similarity using relative values.
Relative Values for Integrative Healthcare
Over the years you have come to rely upon Relative Value Studies, Inc.’s expertise in providing relative value unit publications. Include Relative Values for Integrative Healthcare in our relative value publications.
Very excited about this opportunity to serve the alternative and nursing health care market. Relative Values for Integrative Healthcare will combine unit value information from RVSI with the coding nomenclature.
The best way to fully understand reimbursement and practice management is to receive stepbystep training. Whether you are a beginner or using relative values everyday, Relative Values for Integrative Healthcare will help take the guesswork out of the reimbursement process and will make office tasks easier.
The coding nomenclature provides the only patented coding system for the accurate coding of complementary and alternative medicine services. The integrative medicine codes are intended for use by heath care providers, office managers, and insurance companies, clearing houses, health care specialists and consultants.
Relative Values for Integrative Healthcare is the most useful tool available for establishing, defending and negotiating fees for complementary and alternative medicine services, products and procedures.
Relative Values for Integrative Healthcare focuses exclusively on complementary and alternative medicine and nursing services, products and procedures.
Relative Values for Integrative Healthcare includes procedures and unit values for the following professions:
Acupuncture Homeopathy Naturopathy
Chiropractic Massage Therapy Osteopathy
Holistic Medicine Midwifery Nursing
Just consider this extensive list of features that can help you tackle the business concerns of any alternative medicine practice:
ABC codes and surveyed unit values directly related to alternative medicine, nursing and other integrative healthcare.
Establishes and analyzes alternative medicine, nursing and other integrative healthcare fees based on measurable criteria.
StepByStep instructions for performing practice management and reimbursement tasks (such as productivity measurement, cost of practice);
Develops defensible and justifiable fee schedules;
Simplifies negotiations with third party payers;
Based on reliable, ongoing research using qualified provider surveys;
System flexibility to keep you in control of your practice; and
Available in a format bestsuited for your office needs.
The code designed specifically for integrative healthcare.
Dynamic analysis
Since program comprehension is so expensive, the development of techniques and tools that support this activity can significantly increase the overall efficiency of software development. The literature offers many such techniques: examples include execution trace analysis, architecture reconstruction, and feature location (an activity that involves linking functionalities to source code). Most approaches can be broken down into static and dynamic analyses (and combinations thereof).
Static approaches typically concern (semi)automatic analyses of source code. An important advantage of static analysis is its completeness: a system’s source code essentially represents a full description of the system. One of the major drawbacks is that static analyses often do not capture the system’s behavioral aspects: in objectoriented code, for example, occurrences of late binding and polymorphism are difficult to grasp if runtime information is missing.
The focus of this thesis, on the other hand, is dynamic analysis, which concerns a system’s runtime execution. It is defined by Ball (1999) as “the analysis of the properties of a running software system”. A specification of the properties at hand has been purposely omitted to allow the definition to apply to multiple problem domains. Figure 1.1 shows an overview of the main steps in dynamic analyses: they typically comprise the analysis of a system’s execution through interpretation (e.g., using the Virtual Machine in Java) or instrumentation (e.g., using AspectJ (Kiczales et al., 2001)). The resulting data can be used for such purposes as reverse engineering and debugging, often in the form of execution traces. Program comprehension constitutes one such purpose, and over the years, numerous dynamic analysis approaches have been proposed in this context, with a broad spectrum of different techniques and tools as a result.
Since the definition of dynamic analysis is rather abstract, we shall elaborate on the benefits and limitations of dynamic analysis for program comprehension in particular. The advantages that we consider are:
· The precision with regard to the actual behavior of the software system, for example, in the context of objectoriented software software with its late binding mechanism (Ball, 1999).
· The fact that a goaloriented strategy can be used, which entails the definition of an execution scenario such that only the parts of interest of the software system are analyzed (Koenemann and Robertson, 1991; Zaidman, 2006).
The drawbacks that we distinguish are:
· The inherent incompleteness of dynamic analysis, as the behavior or execution traces under analysis capture only a small fraction of the usually infinite execution domain of the program under study (Ball, 1999). Note that the same limitation applies to software testing.
· The difficulty of determining which scenarios to execute in order to trigger the program elements of interest. In practice, test suites can be used, or recorded executions involving user interaction with the system (Ball, 1999).
· The scalability of dynamic analysis due to the large amounts of data that may be produced by dynamic analysis, affecting performance, storage, and the cognitive load humans can deal with (Zaidman, 2006).
· The observer effect, i.e., the phenomenon in which software acts differently when under observation, might pose a problem in multithreaded or multiprocess software because of timing issues (Andrews, 1997).
In order to deal with these limitations, many techniques propose abstractions or heuristics that allow the grouping or program points or execution points that share certain properties, which results in more highlevel representations of software. In such cases, a tradeoff must be made between recall (are we missing any relevant program points?) and precision (are the program points we direct the user to indeed relevant for his or her comprehension problem?).
References:
1. David Machin. Medical statistics: a textbook for the health sciences / David Machin, Michael J. Campbell, Stephen J Walters. – John Wiley & Sons, Ltd., 2007. – 346 p.
2. Nathan Tintle. Introduction to statistical investigations / Nathan Tintle, Beth Chance, George Cobb, Allan Rossman, Soma Roy, Todd Swanson, Jill VanderStoep. – UCSD BIEB100, Winter 2013. – 540 p.
3. Armitage P. Statistical Methods in Medical Research / P. Armitage, G. Berry, J. Matthews. – Blaskwell Science, 2002. – 826 p.
4. Larry Winner. Introduction to Biostatistics / Larry Winner. – Department of Statistics University of Florida, July 8, 2004. – 204 p.
5. Weiss N. A. (Neil A.) Elementary statistics / Neil A. Weiss; biographies by Carol A. Weiss. – 8th ed., 2012. – 774 p.