Diagnosing

The single skill which most clearly differentiates doctors from members of other clinical specialties is their expertise in diagnosis. Only when we have made a diagnosis can we apply effective treatment - if there is any.

When we firstmeeta patientwe are rarely, if ever, certain of the diagnosis and we have to carry out tests. A 'test' in this context is any item of information that can help us. It includes items from the history, and observations from the clinical examination, as well as such things as laboratory blood tests and X-ray examinations.

Unfortunately tests are rarely perfect: some people with a positive result will not have the disease, and conversely some with a negative result will have it. This is why interpretation may be difficult.

Before we can describe how good a test is, we need to decide what outcome we wish it to predict. We call this the 'reference standard'. Tests are applied to populations, and for obstetricians the population is often pregnant women so for them reference standards might include perinatal death or Down's syndrome or less important endpoints, such as low birthweight or low Apgar scores. If our test is only validated against such less important endpoints, we must always remember that these are not what we are really interested in.

Consider first a test that is either positive or negative and a reference standard that is present or absent. Let us imagine 100 pregnant women undergoing a test of fetal condition. The test is designed to predict whether the baby will live or die. If we apply the test to all 100 women and wait to see what happens, we might get a result like this:

This is the information that a scientist will have after completing research on a test. It is often called a 2 x 2 table. We can use such a table both to see how well the test performs, and to see what the risk of bad outcome is, for an individual with a particular result.

Test performance

By counting vertically in Fig. 31.1 we can see what proportion of the babies who did die (reference standard positive) were detected by the test (Fig. 31.2a). This is the sensitivity or true positive rate (TPR). For this hypothetical test it is 90%. We can also see how many of the babies that lived were correctly predicted by the test (Fig. 31.2b). This is the specificity or true negative rate (TNR). For our hypothetical test it is 80%.

Reference standard +ve Baby dies

Reference standard -ve Baby lives

Test +ve

18

16

34

Test -ve

2

64

66

20

80

100

Fig. 31.1 "Two by two" table of test results.

Fig. 31.1 "Two by two" table of test results.

Reference standard +ve Baby dies

Reference Standard -ve Baby lives

Test +ve

18

16

34

Test -ve

2

64

66

20

80

100

TPR 18/20 = 90%

Reference standard +ve Baby dies

Reference Standard -ve Baby lives

Test +ve

18

16

34

Test -ve

2

64

66

20

80

100

TNR 64/80 = 80%

Fig. 31.2 (a) Sensitivity = True Positive Rate. (b) Specificity = True Negative Rate.

Fig. 31.2 (a) Sensitivity = True Positive Rate. (b) Specificity = True Negative Rate.

These test characteristics do not vary with the prevalence of the disease and are thus a stable measure of test performance. However they do vary with the cut-off value at which we call a test 'positive' or 'negative'. We will look at both these effects below, but first let us look at the results in another way, to predict an individual patient's risk.

Predicting an individual patient's risk

Sensitivity (TPR) and specificity (TNR) describe to the scientist how well the test works, but they are not of much use to the doctor who knows the test result (positive or negative) and wants to know how likely it is that the patient has the disease - or, in our example, that the baby will die. To estimate this we read Fig. 31.1 horizontally rather than vertically.

-ve

Test +ve

18

16

34

PVP 18/34 =53%

Test -ve

2

64

66

20

80

100

-ve

Test +ve

18

16

34

Test -ve

2

64

66

PVN 64/66 = 97%

20

80

100

Fig. 31.3 (a) Predictive value positive. (b) Predictive value negative.

Fig. 31.3 (a) Predictive value positive. (b) Predictive value negative.

First let us look at all the women with a positive result (Fig. 31.3a). Eighteen of the 34 babies actually died, so the predictive value of a positive result is 18 out of 34 or 53%. Note that the predictive value positive is NOT the same as the sensitivity (TPR), which we have already noted was 90%. (Although changing word order sometimes alters meaning in English, predictive value positive is exactly the same as positive predictive value.)

Of those who had a negative test (Fig. 31.3b), 64 out of 66 babies eventually survived. The predictive value of a negative result is thus 64 out of 66 or 97%. Again this is NOT the same as the specificity (TNR). We will see later that the predictive value of a test varies with the prevalence of the disease in the population.

The effect of varying the test cut-off point

So far we have been discussing 'dichotomous' tests, that is, tests that are simply either positive or negative. Most tests, however, give a range of results from strongly positive to strongly negative or, let us imagine for our test, a result between 0 and 100. Figure 31.4a shows the distribution of results for patients with and without disease, that is, 'Reference standard' positive and negative. If we take a cut-off value of 40, and call higher values 'positive', and lower 'negative' we get the same test characteristics as before.

Imagine what would happen if we varied the cut-off. The performance rates would change (Fig. 31.4b). We could, for example, move the cut-off value right up to 75 so that all the results were negative. The false negative and true negative rates would then both be 100%. If we moved the cut-off down to a value of 5 so that all the results were called positive the true and false positives would both be 100% (Fig. 31.4c).

Reference standard

Test +ve

Test -ve

■I ft ftft iftft ft ftft'ftft'ft ftftft ftft

ft

+ 20%

90%

80%

Test -ve

FPR 100

O-(KKD5

30 25 2015105

Reference standard

+ve 100 -ve

ft

+

100%

100%

Test +ve ftftftftftft ftftft ftft ft

+ + + + + + + + ++++++++

Fig. 31.4 (a) The original cut off value of 4.0. (b) Varying the cut-off value: High - No false positives but sensitivity zero. (c) Varying the cut-off value: Low - No false negatives but specificity zero. (d) Receiver Operator Characteristic (ROC) curve.

To see what is happening we can plot how the true and false positive rates vary with the cut-off level (Fig. 31.4d). We call the result a receiver operator characteristic or ROC curve. Figure 31.5a-d shows the 2 x 2 tables for a perfect and a useless test and their corresponding ROC curves.

The name Receiver Operator Characteristic arose during the Battle of Britain in 1940 when the German air force regularly flew bombing raids over London and Southern England. The British had just developed an early version of RADAR which was only a moderately good test for discriminating between reference standard positive - a flight of bombers, and reference standard negative - a flock of seagulls. The operators soon learned that by setting their dials to be very sensitive they never missed any incoming bombers but often called out their own pilots unnecessarily to defend against seagulls. If they set the dials to be more specific there were fewer false alarms but occasionally they would fail to detect the bombers in time. They called the discriminatory ability of each RADAR

Chart Seagull Population

machine its 'receiver operator characteristic' (ROC) and the name stuck.

THE EFFECT OF DIFFERENT DISEASE PREVALENCE ON THE INTERPRETATION OF TEST RESULTS

The predictive value of a test result varies with the disease prevalence. As clinicians often say, 'Common things are common'.

Bayes theorem and likelihood ratios

Calculating the predictive value of a test result directly from the 2 x 2 table as we did earlier, is only possible if the patient in front of you comes from a population with the same disease prevalence as that of the patients on whom the test was originally developed. This is rarely the case. Tests are frequently developed on high-risk patients in teaching hospitals, and then applied to patients in a low-risk practice. Occasionally the reverse may happen,

-ve

Test +ve

18

16

34

Test -ve

2

64

66

20

80

100

TPR = 90%

= 20%

| | the prior odds of disease I | the likelihood ratio of a positive test result (a) I I the posterior odds of disease

-ve

Test +ve

9

8

PVP

Test -ve

Prior odds

1

4

LR +ve

9

2

| | the likelihood ratio of a positive test result 9:2

| | the posterior odds of disease 9:8 (The predictive value positive expressed as odds)

(b) 1:4 X 9:2 = 9:8 Prior odds X likelihood ratio = posterior odds

| | the likelihood ratio of a positive test result 9:2

| | the posterior odds of disease 9:8 (The predictive value positive expressed as odds)

(b) 1:4 X 9:2 = 9:8 Prior odds X likelihood ratio = posterior odds

Fig. 31.6 (a) Bayes' theorem. (b) Bayes' theorem (ratios simplified).

if we apply a test to someone we already know to be at particularly high risk.

Doctors need a way of calculating the predictive value of a test from the information that they are likely to have available, namely the prevalence of the disease in their population and the test characteristics.

It can be done! We will change all our rates to odds for the following section since this makes the mathematics easier. A risk of 1 in 4 is the same as odds of 1:3. A risk of 90% is the same as odds of 90:10 or 9:1.

Let us return now to our original 2 x 2 table redrawn to show the crucial information (Fig. 31.6a,b). These two figures are identical except that in Fig. 31.6b the ratios have been simplified (divided by their lowest common denominator). The top row gives us the information we need - the predictive value positive. With a positive test the odds of having the disease are 18:16 or 9:8 (or slightly better than evens). The disease prevalence gives us the odds before the test result, that is 20:80 or 1:4. We call these the prior odds. What we need from the test is a single measure, the diagnostic value, which changed the prior odds of 1:4 to posterior odds of 9:8.

The answer is the ratio between the true positive rates (90%) and the false positive rate (20%). This is called the likelihood ratio of a positive result (LR +ve) because it is the ratio between the likelihood of having a positive result if the disease is present, against the likelihood of having a positive result if the disease is absent. The LR +ve for this test is thus 9:2. The relationship between these three figures is:

Prior odds x likelihood ratio = posterior odds.

This is one of the most famous theorems in statistics, named the Bayes Theorem after its author, the Reverend Thomas Bayes. It comes in a number of different versions but this one, the odds-likelihood ratio form, is the most useful for doctors.

A dichotomous test (one that is either positive or negative) will have two likelihood ratios, one for a positive result and one for a negative result. If the test has a range of results the LR will vary also with the cut-off point chosen. It is possible to plot ROC curves as likelihood ratios.

If the doctor knows the LR for the test result and the prior odds, calculation of the individual patient's risk is easy. Here is an example.

Bayes and pregnancy tests; the Bayesian's wife

My wife felt pregnant but had a negative pregnancy test. Her doctor told her that 10% of pregnant women had a negative result on first testing but she was still disappointed. I calculated the probabilities as follows. I asked her how sure she had been that she was pregnant before the test and she said 95% certain. This gave prior odds of pregnancy of 19:1 which I rounded to 20:1. I assumed that the false positive rate was virtually nil so I got a LR for a negative test of 1:10 (likelihood of a negative result given pregnancy = 10%: likelihood of negative result given no pregnancy = 100%). This gave posterior odds that she was in fact pregnant of 20:1 x l:10 = 2:1 and I converted this for her benefit into a probability of two in three. This estimate cheered her up and a couple of weeks later the pregnancy was confirmed. Adapted from [2].

Treating

Effective treatments

Some treatments such as insulin for diabetes, salpingectomy for ruptured tubal pregnancy, or in vitro fertilization for bilateral tubal blockage work so much better than the alternatives, that no one doubted early reports of their effectiveness. Few treatments, however, are so clear cut. More commonly, disease definitions are hazy, the prognosis is variable no matter what is done, and treatment is only partially effective. How is the practising doctor to choose?

The best way to measure the effectiveness of a treatment is to compare a group of patients given the treatment (the intervention or treatment group), with another group not given it (the control group). If there is already an established treatment for the disease, the control group will usually be given this treatment and the experimental group the new treatment. If the two groups are otherwise similar, and the treated patients have better outcomes, we conclude that the treatment was effective.

Difficulties arise if the two groups were not really comparable, or the play of chance misled us.

Bias

We call a systematic difference between the groups bias. It cannot be eliminated simply by increasing the sample size. There are many potential sources of bias.

If a new treatment is compared with treatments used in the past, the comparison will be subject to bias if other aspects of patient care are improving. Comparisons between patients treated by different doctors or in different hospitals are often biased. The doctor evaluating the treatment is often an expert whose patients will do better than those treated by other doctors. Conversely an expert may be referred the more difficult cases, biasing the results against the new treatment. Yet again, an expert with special diagnostic skills may make the diagnosis in milder cases, thus biasing results in favour of the new treatment.

Even if the same doctor administers the treatment, a non-randomized comparison of treatments may mislead. For example, comparisons of women who underwent amniotomy early in labour with those who did not will be bedevilled by subtle differences between the groups resulting from the reasons they underwent the amniotomy.

In theory, if we know what factors influence the outcome of treatment, the cases and controls can be matched for these factors to make the two groups similar. For example cases of cancer can be matched for the stage of the disease. A staging procedure, however, may not cover all the variation in say, spread of cancer. Two cancers may both be at the same stage but one may be a lot faster growing than the other. If the larger ones within each stage were more likely to be in the control group, the trial would be biased against the treatment. In practice we rarely know about all the possible factors that might affect outcome. An unknown factor can still bias results if it was unequally distributed between groups.

Bias must have affected many non-randomized studies of the effect of hormone replacement therapy in the 1980s and 1990s. Even after adjustment for known risk factors, such studies had suggested that hormone replacement therapy reduced heart attack and strokes. However, atleastnine randomized controlled trials have now shown that it increases both [3]. The earlier results must have been due to bias in unknown risk factors.

The only way to ensure that two groups of patients are matched for unknown risk factors is to select them at random. This means using the toss of a coin, random number tables, or other forms of computer-generated random numbers, to select who gets the new treatment.

Allocation to treatment or placebo groups alternately, by day of the week, or last digit of hospital number is unsatisfactory since the doctors know the group allocation when they enter the trial, giving them scope for entry bias. For example, a trial comparing oxytocin infusion with prostaglandin pessary, as treatment for pregnant women whose membranes ruptured before labour began, used allocation by the final digit of the hospital number and got a biased result. The reason was that some staff believed, rightly or wrongly, that prostaglandin was the better treatment if the cervix was unripe. Women with a soft open cervix were entered in the trial without problems, but those with an unripe cervix were entered only if the digit was even (allocation to prostaglandin). If the final digit was odd the staff knew the allocation would be to oxytocin and some therefore did not enter the patient into the trial at all. Instead they gave prostaglandin anyway outside the trial. Thus, women in the trial allocated to oxytocin had riper cervices. Not surprisingly prostaglandin appeared to give worse results.

Bias can also occur if patients are more likely to be excluded from the trial in one group than the other. For example, we might wish to compare two policies for dealing with the membranes in labour; leave them intact or rupture them. Women are allocated at random in early labour - half to have the membranes ruptured, and half left intact. Some women allocated to the membrane rupture group are likely to make slow progress and eventually get the membranes ruptured anyway. If we remove them from analysis we will bias results against the membrane rupture policy, because we will be removing women with difficult labours from the 'leave intact' group. The situation will be even worse if some women in the 'membrane rupture' group labour so quickly that there is no time to artificially rupture the membranes. If we exclude them from the 'membrane rupture' group we will again be biasing results against rupture, by excluding the most favourable cases from that group.

We avoid the problem by analysing the trial by 'intention to treat'. This means comparing the two groups as chance allocated them. It may seem strange to include the results for some women who had the membranes left intact in the 'membrane rupture' group and some women who had them ruptured in the 'leave intact' group, but there is no other way to avoid bias. Of course we must ensure that compliance is high, if we really want to find out if a drug or procedure can work.

Biased assessment of outcome

Even if the two groups of patients in a trial are exactly comparable, bias may creep in if the investigators recording the outcomes know what treatment was received. For example, it is widely believed that amniotomy causes neonatal infection. If the doctors caring for the babies know that the membranes were ruptured early in labour they might do more tests for infection, and make the belief a self-fulfilling prophecy. If the doctors are not told which group the patient is in, they are said to be blinded, and their assessment of outcome can not be influenced.

If the patient believes that the treatment she has received works, that may also result in a self-fulfilling prophecy, the placebo effect. Trials of many treatments for premenstrual tension have shown high cure rates among patients given inactive medicine. Usually the active treatment has been no better, and investigators have concluded that the treatment was ineffective. If no placebo had been used, the treatment might have been wrongly given the credit for the improvement. Patients are also said to be 'blinded' when they do not know which treatment they are receiving. A trial is said to be 'double blind' when neither the patient nor the investigator knows which treatment is being given.

Blinding is not always necessary. If the end point is unambiguous - like death, or Caesarean section - blinding at the stage of outcome assessment is unnecessary, although it may still be necessary during the treatment period. If blinding is not possible, but the outcome is susceptible to observer bias, it may be possible to eliminate it by having independent observers make the assessment.

Statistical problems - avoiding being misled by the play of chance

Random variation differs from bias in that increasing the sample size will reduce it. There are two ways that the play of chance can mislead.

1 We may think there is a difference, when all we saw was chance variation. This is called a type 1, or alpha, error.

2 We may fail to realize that there is a real difference. This is called a type 2, or beta, error.

A number of statistical tests have been used to avoid making these errors.

TYPE 1 ERROR

The results of an experiment usually take the form of a series of observations on a treatment and control group. These may be continuous measures such as birthweights or dichotomous variables such as death rates. We will concentrate on dichotomous outcomes.

If the mortality rates are identical between treatment and control groups we won't be misled into a false belief that the treatment was effective. The difficulty arises when there is a difference and we want to know whether it occurred by chance or was caused by the treatment. Most statistical tests tell us how likely it was to occur by chance. This is the familiar P-value. By convention we call P < 0.05 'statistically significant'. This means that there was a less than 5% chance that the difference we observed would occur by chance if the treatments were equally effective.

Although hallowed by tradition, this method of presenting results simply tells us that we have not made a type 1 error. If P is greater than 0.05 we do not know whether there really is no important difference, or whether our trial was too small. Nor does the P value tell us the likely size of any difference. A very small and clinically unimportant difference may be statistically highly significant if a large study has been performed.

TYPE 2 ERROR

Here we are concerned with the probability that a trial has failed to show a real effect, that is, a false negative result. The size of effect that it would matter to miss is a clinically meaningful effect - one that would cause doctors to change treatment. The probability of a negative result depends on the size of this minimum treatment effect we wish to detect. A particular trial can exclude a large effect more reliably than a small effect. We can measure the probability of failing to detect a particular size of effect. We call this the type 2 or beta error. The inverse of the type 2 error, the probability that the trial will indeed detect this size effect if it is really there is called the power of the trial.

ODDS RATIOS AND THEIR CONFIDENCE INTERVALS

A good way to look at treatment effects is to consider the risks or 'odds' of the bad outcome after treatment versus control therapy. The effect of the treatment is given in the form of a relative risk (RR) or an odds ratio (OR).

An OR of 1 indicates that the treatment does not alter the adverse outcome. An OR below 1 indicates that the chances of adverse outcome are reduced by treatment. An OR above 1 indicates that treatment increases these odds. An OR of 0.5 indicates a halving of the odds and an OR of 2 a doubling.

The 95% confidence intervals (CI) can be calculated. An OR of 0.5 with 95% CI '0.25-1', indicates that we have found a halving in the odds of the outcome and can be 95% confident that the true effect lies between an OR of 0.25 and 1.

Review: Prophylactic corticosteroids for preterm birth Comparison: 01 Corticosteroids versus placebo or no treatment Outcome: 02 Neonatal death

Study

Treatment n/N

Control n/N

Peto odds ratio 95% CI

Weight (%)

Peto odds ratio 95% CI

01 Neonatal death (all babies) AMSTERDAM 1980

3/64

12/58

4.4

0.23 [0.08, 0.67]

AUCKLAND 1972

36/532

60/538

29.3

0.58 [0.38, 0.89]

BLOCK 1977

1/69

5/61

—-1-

1.9

0.22 [0.04, 1.12]

DORAN 1980

4/81

11/63

-«-■-

4.5

0.26 [0.09, 0.77]

GAMSU 1980

14/131

20/137

-■

10.0

0.70 [0.34, 1.44]

GARITE 1992

9/40

11/42

5.1

0.82 [0.30, 2.24]

KARI 1994

6/95

9/94

4.7

0.64 [0.22, 1.84]

MORALES 1986

7/121

13/124

-É-

6.2

0.54 [0.22, 1.33]

MORRISON 1978

3/67

7/59

-•-

3.1

0.37 [0.10, 1.33]

PAPAGEORGIOU 1979

1/71

7/75

2.6

0.22 [0.05, 0.91]

PARSONS 1988

0/23

1/22

0.13 [0.00, 6.52]

0.3

SCHMIDT 1984

5/49

4/31

2.6

0.77 [0.19, 3.15]

TAUESCH 1979

8/56

10/71

5.1

1.02 [0.37, 2.76]

US STEROID TRIAL

32/371

34/372

-1

-

20.2

0.94 [0.57, 1.56]

Subtotal (95% CI) 1770 Total events: 129 (Treatment), 204 (Control) Test for heterogeneity chi-square=14.70 df=13 p Test for overall effect z = 4.42 p=0.00001

1747 =0.33 F=11.6%

100.0

0.60 [0.48, 0.75]

02 Neonatal death in habies treatment before 1980

AMSTERDAM 1980 3/64 12/58

7.3

0.23 [0.08, 0.67]

AUCKLAND 1972

36/532

60/538

48.1

0.58 [0.38, 0.89]

BLOCK 1977

1/69

5/61

--4-

3.2

0.22 [0.04, 1.12]

DORAN 1980

4/81

11/63

-■-

7.3

0.26 [0.00, 0.77]

GAMSU 1980

14/131

20/137

-■

16.4

0.70 [0.34, 1.44]

MORRISON 1978

3/67

7/59

5.1

0.37 [0.10, 1.33]

PAPAGEORGIOU 1979

1/71

7/75

4.2

0.22 [0.05, 0.91]

TAUESCH 1979

8/56

10/71

-

1-

8.4

1.02 [0.37, 2.76]

Subtotal (95% CI) 10/71 Total events: 70 (Treatment), 132 (Control) Test for heterogeneity chi-square=9.20 df=7 p = Test for overall effect z = 4.60 p<0.00001

1062 0.24 F = 23.9%

100.0

0.51 [0.38, 0.68]

03 Neonatal death in babies treated after 1980 GARITE 1992 9/40

11/42

-■

13.1

0.82 [0.30, 2.24]

KARI 1994

6/95

9/94

-■-

11.9

0.64 [0.22, 1.84]

MORALES 1986

7/121

13/124

-■-

15.8

0.54 [0.22, 1.33]

PARSONS 1988

0/23

1/22

__

0.13 [0.00, 6.52]

0.9

SCHMIDT 1984

5/49

4/31

-■

6.6

0.77 [0.19, 3.15]

US STEROID TRIAL

32/371

34/372

5.17

0.94 [0.57, 1.56]

Subtotal (95% CI) 699

Total events: 59 (Treatment), 72 (Control)

Test for heterogeneity chi-square=2.12 df=5 p =

685 0.83 F = 0.0%

100.0

0.78 [0.54, 1.12]

Test for overall effect z=1.33 p = 0.2

Test for overall effect z=1.33 p = 0.2

Fig. 31.7 Systematic review of randomized trials of the effect of corticosteroids for preterm on neonatal death.

This corresponds to a P-value of 0.05 since the confidence interval just includes 1. If the 95% CI fail to reach 1, the P-value is less than 0.05, that is. it is significant. If the 95% CI includes 1 the P-value is greater than 0.05, that is, the result is not statistically significant.

META-ANALYSIS

Type 2 errors are minimized by very large trials. Unfortunately these are difficult and expensive to perform. One solution is to combine the results of small and medium sized trials. The results can be conveniently presented as ORs for each trial separately and then beneath them the typical OR and CI for all the trials combined, the meta-analysis.

Figure 31.7, a version of one of the most famous meta-analysis graphs in obstetrics, is reproduced from the Cochrane collaboration review of the effect of corticos-teroids to prevent neonatal death from respiratory distress syndrome after preterm labour. This type of graphical representation of the results of all the randomized trials of a particular intervention in a disease is now one of the most popular ways to summarize and present the best evidence of effectiveness.

Was this article helpful?

0 0
Pregnancy Diet Plan

Pregnancy Diet Plan

The first trimester is very important for the mother and the baby. For most women it is common to find out about their pregnancy after they have missed their menstrual cycle. Since, not all women note their menstrual cycle and dates of intercourse, it may cause slight confusion about the exact date of conception. That is why most women find out that they are pregnant only after one month of pregnancy.

Get My Free Ebook


Post a comment