Comparing the Accuracy of Two Binary Diagnostic Tests in a Paired Study Design

There are many medical tests for detecting the presence of a disease or condition. Some examples include tests for lesions, cancer, pregnancy, or COVID-19. While these tests are usually accurate, they’re not perfect. In addition, some tests are designed to detect the same condition, but use a different method. A recent example are PCR and antigen tests for COVID-19. In these cases we might want to compare the two tests on the same subjects. This is known as a paired study design. This can help us assess if one test is better than the other, or maybe just as good as the other. The latter may be of interest if an alternative test is cheaper or less painful or invasive.

In this article we discuss some approaches for carrying out such analyses in R.

We’ll look at data presented in the paper Comparing the Predictive Values of Diagnostic Tests: Sample Size and Analysis for Paired Study Designs by Moskowitz and Pepe (2006). In the paper the authors analyze data from the National Cystic Fibrosis Patient Registry. The goal is to compare two prognostic factors as tests for identifying subjects that will develop severe respiratory infection. The sample size is 11,960, of which 5,054 experienced severe respiratory infection.

The first prognostic factor, or test, is a positive culture for the bacterium Pseudomonas Aeruginosa, which we will call “PA” for the remainder of the article. In the two-way table below we present the results of the test (positive or negative) versus whether or not an infection occurred (presence or absence). We enter the data as a matrix and add names to the rows and columns using the dimnames() function. Finally we print the matrix with margin totals courtesy of the addmargins() function.


PA <- matrix(c(185, 4869, 123, 6783), ncol = 2)
dimnames(PA) <- list("Test" = c("Positive", "Negative"),
                      "Disease" = c("Present", "Absent"))
addmargins(PA)


            Disease
  Test       Present Absent   Sum
    Positive     185    123   308
    Negative    4869   6783 11652
    Sum         5054   6906 11960

Two common statistics calculated from such a table are sensitivity and specificity. Each estimate the probability of a correct test result given disease or condition status. Sensitivity is the proportion of positive tests out of everyone who has the disease or condition. Specificity is the proportion of negative tests out of everyone who does not have the disease or condition.

In our data:

sensitivity = 185/5054 = 0.037
specificity = 6783/6906 = 0.982

We can calculate these using matrix indexing as follows:


# sensitivity: 185/5054
PA[1,1]/sum(PA[1,1], PA[2,1])


  [1] 0.03660467


# specificity: 6783/6906
PA[2,2]/sum(PA[1,2], PA[2,2])


  [1] 0.9821894

It appears the PA test has about a 3% chance of being “positive” given someone with cystic fibrosis (CF) will develop a severe respiratory infection. That’s not good at all. On the other hand, it has about a 98% chance of being “negative” given someone with CF will not develop a severe respiratory infection.

We can flip sensitivity and specificity around and calculate two other statistics: Positive Predictive Value (PPV) and Negative Predictive Value (NPV). The first estimates the probability of having the disease or condition given a positive test result. The latter estimates the probability of not having the disease or condition given a negative test result.

In our data:

PPV = 185/308 = 0.601
NPV = 6783/11652 = 0.582

We can also calculate these using matrix indexing:


# PPV: 185/308
PA[1,1]/sum(PA[1,1], PA[1,2])


  [1] 0.6006494


# NPV: 6783/11562
PA[2,2]/sum(PA[2,1], PA[2,2])


  [1] 0.5821318

It appears there is a 60% chance someone with CF will develop a severe respiratory infection given a positive PA test. Likewise there is about a 58% chance someone with CF will not develop a respiratory infection given a negative PA test.

According to Moskowitz and Pepe (2006) sensitivity and specificity statistics tend to be used in the early stages of test development, while PPV and NPV are used in the later stages. At first we’re interested in the performance of the test given disease (sensitivity and specificity). But in the end – in a clinical setting – we’re more interested in the disease status given the test result (PPV and NPV).

We can quickly calculate these statistics in R using the acc.1test() function from the DTComPair package. To do this we need to supply the data as a tab.1test object, which we can do with the read.tab.1test() function. We enter the cell totals from our table above into four arguments:

a: The number of diseased subjects with a positive test.
b: The number of non-diseased subjects with a positive test.
c: The number of diseased subjects with a negative test.
d: The number of non-diseased subjects with a negative test.

We also supply the name of our test to the testname argument. Then we save the result to “PA_tab” and use the acc.1test() function to calculate our statistics.


# install.packages("DTComPair")
library(DTComPair)
PA_tab <- read.tab.1test(a = 185, b = 123, c = 4869, d = 6783, 
                          testname = "PA")
acc.1test(PA_tab)


  Diagnostic accuracy of test 'PA'
  
  (Estimates, standard errors and 95%-confidence intervals)
  
                    Est.          SE  Lower CL   Upper CL
  Sensitivity 0.03660467 0.002641514 0.0314274 0.04178194
  Specificity 0.98218940 0.001591562 0.9790700 0.98530881
  PPV         0.60064935 0.027906948 0.5459527 0.65534596
  NPV         0.58213182 0.004569094 0.5731766 0.59108708
  
             Est.    SE (log)  Lower CL  Upper CL
  PDLR  2.0552183 0.114860010 1.6409258 2.5741092
  NDLR  0.9808651 0.003184913 0.9747613 0.9870071

The output presents estimates for all four statistics along with standard errors and 95% confidence intervals. Let’s recap these interpretations one more time:

sensitivity: about 3.6% of subjects will get a positive bacterium test if they develop a respiratory infection. 95% CI: [3.1, 4.2]
specificity: about 98.2% of subjects will get a negative bacterium test if they do not develop a respiratory infection. 95% CI: [97.9, 98.5]
PPV: about 60% of subjects develop a respiratory infection given a positive bacterium test. 95% CI: [54.6, 65.5]
NPP: about 58% of subjects do not develop a respiratory infection given a negative bacterium test. 95% CI: [57.3, 59.1]

The other two estimates, PDLR and NDLR, are for the positive and negative diagnostic likelihood ratios. More information on these statistics, and the ones we just discussed, can be found on this helpful web page on the Columbia Public Health web site.

Getting back to our Cystic Fibrosis data, the second prognostic factor of interest is the occurrence of a prior respiratory infection. The paper refers to these as pulmonary exacerbations, or PExs. As we did before, we present the results of the test (positive or negative) versus whether or not an infection occurred (presence or absence) in a two-way table. In this case “positive” means the subject had a prior PEx, and “negative” means the subject did not have a prior PEx.


PEx <- matrix(c(3630, 1424, 1342, 5564), ncol = 2)
dimnames(PEx) <- list("Test" = c("Positive", "Negative"),
                      "Disease" = c("Present", "Absent"))
addmargins(PEx)


            Disease
  Test       Present Absent   Sum
    Positive    3630   1342  4972
    Negative    1424   5564  6988
    Sum         5054   6906 11960

And once again we can calculate statistics of interest using functions from the DTComPair package:


PEx_tab <- read.tab.1test(a = 3630, b = 1342, c = 1424, d = 5564, 
                          testname = "PEx")
acc.1test(PEx_tab)


  Diagnostic accuracy of test 'PEx'
  
  (Estimates, standard errors and 95%-confidence intervals)
  
                   Est.          SE  Lower CL  Upper CL
  Sensitivity 0.7182430 0.006327839 0.7058406 0.7306453
  Specificity 0.8056762 0.004761348 0.7963442 0.8150083
  PPV         0.7300885 0.006295539 0.7177495 0.7424275
  NPV         0.7962221 0.004818582 0.7867778 0.8056663
  
            Est.   SE (log) Lower CL  Upper CL
  PDLR  3.696115 0.02603793 3.512222 3.8896361
  NDLR  0.349715 0.02322303 0.334154 0.3660005

The PEx test appears to be superior to the PA test in terms of sensitivity, PPV and NPV.

Let’s say we wanted to formally compare the PPV statistics for each test. The PPV for the PA test is 0.60. The PPV for the PEx test is 0.73. The latter test seems better. It correctly identified 73% of the subjects with a respiratory infection based on them having a prior infection. The bacterium test only identified 60% of the subjects. Is this difference meaningful? Would we expect to see a difference like this again given a different sample?

We can statistically compare PPV and NPV values using the pv.rpv() function, also from the DTComPair package. Before we show how to do this it’s important to note that the following procedure is for paired data. That is, it assumes both tests were conducted on the same patients.

Unfortunately we cannot use our earlier data objects to carry out this analysis. Instead we need to set up our data as a paired data object. We can do that using the tab.paired() function. It requires three arguments:

d: A numeric vector specifying the gold-standard results (1 = presence of disease, 0 = absence of disease).
y1: A numeric vector specifying the results of diagnostic test 1 (1 = positive, 0 = negative).
y2: A numeric vector specifying the results of diagnostic test 2 (1 = positive, 0 = negative).

We first create vectors for each of these arguments. Since we have 5,054 subjects with infection, we have 11,096 - 5,054 = 6,906 without infection. We use the rep() function to create and repeat the necessary ones and zeroes.


d <- c(rep(1, 5054), rep(0, 6906))

Creating the y1 and y2 vectors take a little more work. We basically create a vector of ones and zeroes according to the original accuracy tables we created above. We start in the upper-left cell and work our way down the columns.


# PEx
y1 <- c(rep(1,3630), rep(0,1424), rep(1, 1342), rep(0,5564))
# PA
y2 <- c(rep(1,185), rep(0,4869), rep(1, 123), rep(0,6783))

We then use the tab.paired() function to create our object, which we save as “paired.d”. Notice we also use the testnames argument to label our tests. Printing the data object shows the paired data stratified by “Diseased” and “Non-diseased”. In our case “Diseased” corresponds to those with respiratory infections. Test 1 (PEx) is in the columns, Test 2 (PA) is in the rows. We see that, for example, 3,445 of the subjects with infection had a “positive” PEx test but a “negative” PA test.


paired.d <- tab.paired(d, y1, y2, testnames = c("PEx", "PA"))
paired.d


  Two binary diagnostic tests (paired design)
  
  Test1: 'PEx'
  Test2: 'PA'
  
  Diseased:
             Test1 pos. Test1 neg. Total
  Test2 pos.        185          0   185
  Test2 neg.       3445       1424  4869
  Total            3630       1424  5054
  
  Non-diseased:
             Test1 pos. Test1 neg. Total
  Test2 pos.        123          0   123
  Test2 neg.       1219       5564  6783
  Total            1342       5564  6906

With our paired data object created, we can carry out a formal statistical comparison between the PPV statistics for each test using the pv.rpv() function. The function name is an abbreviation of “Comparison of Predictive Values using Relative Predictive Values.” In other words, instead of comparing the absolute difference in PPV or NPV, it compares the relative difference. We save the result to an object named “rpv.test” and then we look at the ppv results by extracting the ppv component.


rpv.test <- pv.rpv(paired.d)
rpv.test$ppv


  $test1
  [1] 0.7300885
  
  $test2
  [1] 0.6006494
  
  $rppv
  [1] 1.215499
  
  $se.log.rppv
  [1] 0.04486625
  
  $lcl.rppv
  [1] 1.113177
  
  $ucl.rppv
  [1] 1.327225
  
  $test.statistic
  [1] 4.349694
  
  $p.value
  [1] 1.363278e-05

The output is a list of values. The first two elements, test1 and test2, list the PPV statistics for each test, which matches our calculations above. The third element is the relative PPV, which is 0.73/0.60 = 1.22. It says the Test 1 PPV (PEx) is about 22% higher than the Test 1 (PA) PPV. The next element is the estimated standard error of the relative PPV, and the following two elements provide a 95% confidence interval on the relative PPV: [1.11, 1.33] This suggests the PPV of the PEx prognostic factor is at least 11% higher than PA. Finally the last two elements return a test statistic and p-value for the null hypothesis test that the relative PPV is equal to 1 (Recall: a ratio of 1 implies no difference.) The p-value is small and provides strong evidence against the null. We conclude that if someone with cystic fibrosis receives a “positive” PEx test, they have a higher probability of developing a respiratory infection than if they receive a “positive” PA test.

We can also see results for the comparison of the two NPV statistics by extracting the npv element. We present the code but leave the exploration of the output to the reader.


rpv.test$npv

The details of these procedures are detailed in Moskowitz and Pepe (2006).

The DTComPair package also provides a function for testing absolute differences in NPV and PPV called pv.gs(). However it provides no confidence interval on the absolute difference between PPV or NPV.

To compare sensitivities and specificities, we use the McNemar Test. The name of the function for this procedure is sesp.mcnemar(). Once again it requires a paired data object. Below we run the function on our data and save to “mtest”. To see the results for “sensitivity”, call mtest$sensitivity. To see the results for specificity, call mtest$specificity. Each test is for the null that the absolute difference is 0. Small p-values provide evidence against the null. Below we report the results for the comparison of specificity. The difference is large and the p-value is virtually 0.


mtest <- sesp.mcnemar(paired.d)
mtest$specificity


  $test1
  [1] 0.8056762
  
  $test2
  [1] 0.9821894
  
  $diff
  [1] 0.1765132
  
  $test.statistic
  [1] 1219
  
  $p.value
  [1] 0

To get confidence intervals for differences in sensitivity and specificity, we can use the sesp.diff.ci() function. As before you need a paired data object. Below we demonstrate the function and extract the 95% confidence interval for specificity. The lower and upper limits are presented under diff.lcl and diff.ucl. The Wald method is used to calculate the confidence interval. (Other methods are available. Enter ?sesp.diff.ci in the R console to view the documentation.) It appears the difference in specificity between the PEx (test1) and PA (test2) prognostic factors is plausibly between 0.17 and 0.19.


ci <- sesp.diff.ci(paired.d)
ci$specificity


        test1       test2        diff     diff.se    diff.lcl    diff.ucl 
  0.805676224 0.982189401 0.176513177 0.004587791 0.167521272 0.185505082

Hopefully you now have a better understanding of how to compare the accuracy of two binary diagnostic tests in a paired study design.

R session details

The analysis was done using the R Statistical language (v4.5.1; R Core Team, 2025) on Windows 11 x64, using the packages PropCIs (v0.3.0) and DTComPair (v1.2.6).

References

Stock C, Hielscher T (2014). DTComPair: comparison of binary diagnostic tests in a paired study design. R package version 1.0.3. URL: http://CRAN.R-project.org/package=DTComPair
McNemar Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153-7.
Moskowitz CS, and Pepe MS (2006). Comparing the predictive values of diagnostic tests: sample size and analysis for paired study designs. Clin Trials, 3(3):272-9.
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
“Evaluating Risk Prediction with ROC Curves.” Columbia Public Health, accessed 30 March 2022, https://www.publichealth.columbia.edu/research/population-health-methods/evaluating-risk-prediction-roc-curves.

Clay Ford
Statistical Research Consultant
University of Virginia Library
March 30, 2022

For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.