Understanding Polychoric Correlation

Polychoric correlation is a measure of association between two ordered categorical variables, each assumed to represent latent continuous variables that have a bivariate standard normal distribution. When we say two variables have a bivariate standard normal distribution, we mean they’re both normally distributed with mean 0 and standard deviation 1, and that they have linear correlation. Polychoric correlation might be suitable for quantifying association between, say, questionnaire items that ask respondents to rate their satisfaction with various services on a scale of 1 to 5. In this scenario it may be plausible to imagine the ordered categorical “satisfaction” response as representing a cutpoint, or threshold, of a continuous latent variable.

We can generate this type of data by first drawing a random sample from a bivariate standard normal distribution. We’ll use R to demonstrate (R Core Team 2024). Below we us the rmvnorm() function from the {mvtnorm} package (Genz and Bretz 2009) to randomly sample 1000 observations from a bivariate standard normal distribution with correlation 0.5. This package does not come with R, so you’ll need to install it if you don’t already have it. Since this is a bivariate distribution, we need to specify two means, hence the vector c(0,0). Notice also that the correlation needs to be supplied as a 2 x 2 matrix. The set.seed(99) function allows you to simulate the same data if you wish to follow along.


# install.packages("mvtnorm")
library(mvtnorm)
set.seed(99)
d <- rmvnorm(1000, mean = c(0, 0), 
             sigma = matrix(c(1, 0.5, 0.5, 1), 2, 2))

The result is a two-column matrix:


head(d)


           [,1]        [,2]
[1,]  0.3308166  0.51869175
[2,]  0.1997150  0.45146614
[3,] -0.3187241  0.02458465
[4,] -0.7076863  0.24936114
[5,] -0.6866844 -1.34438217
[6,] -0.4818428  0.69713006

We can calculate the Pearson correlation using the cor() function. Notice it’s close to the “true” value of 0.5 we used to simulate the data.


cor(d)


          [,1]      [,2]
[1,] 1.0000000 0.5089789
[2,] 0.5089789 1.0000000

Now let’s chop this data into five bins using the cut() function and create a cross tabulation. We arbitrarily set the interior breaks to -1, 0.5, 0.5, and 1. We also set the lower and upper bounds to infinity to ensure we capture all values. Since there are six total break values, we’ll have five bins which we label 1 through 5.


x <- cut(d[,1], breaks = c(-Inf, -1, -0.5, 0.5, 1, Inf), labels = 1:5)
y <- cut(d[,2], breaks = c(-Inf, -1, -0.5, 0.5, 1, Inf), labels = 1:5)
table(x, y)


   y
x     1   2   3   4   5
  1  72  34  40   7   5
  2  39  33  51  17  15
  3  40  50 166  59  68
  4   7  14  68  25  37
  5   3  10  47  38  55

We might pretend this is a cross-tabulation of two survey questions that ask customers to rate satisfaction on two different aspects of a company’s service. Perhaps the scale is 1 to 5, where 1 is poor and 5 is excellent.

To calculate the polychoric correlation between these two variables, we can use the polychor() function from the {polycor} package (Fox 2022). Again you will need to install this package if you don’t already have it.


# install.packages("polycor")
library(polycor)
polychor(x, y)


[1] 0.4967263

Notice the result is quite close to the “true” correlation of 0.5 we used to generate the original continuous data.

Since we’re assuming this ordered categorical data is based on a latent continuous variable that we’ve grouped into bins, we can request the estimated cutpoints by setting thresholds = TRUE. We see the estimated thresholds are very close to the ones we specified when we chopped our data into bins.


polychor(x, y, thresholds = TRUE)



Polychoric Correlation,  2-step est.  =  0.4967 


  Row Thresholds         
1 -1.0030
2 -0.4874
3  0.5129
4  1.0240


  Column Thresholds         
1 -0.9904
2 -0.5187
3  0.4510
4  0.9154

We can also request a standard error estimate for the polychoric correlation by setting std.err = TRUE. Below the standard error is estimated to be about 0.03. This also returns a test of bivariate normality. The null hypothesis is the latent data underlying the ordinal categories is bivariate normal. With a p-value of 0.1997, we find no reason to reject that hypothesis.


polychor(x, y, thresholds = TRUE, std.err = TRUE)



Polychoric Correlation, 2-step est. = 0.4967 (0.02567)
Test of bivariate normality: Chisquare = 19.32, df = 15, p = 0.1997

  Row Thresholds         
1 -1.0030
2 -0.4874
3  0.5129
4  1.0240


  Column Thresholds         
1 -0.9904
2 -0.5187
3  0.4510
4  0.9154

This example we just presented is a slight modification of the example provided on the help page for the polychor() function (see help(polychor)). The help page demonstrates not only how to use the polychor() function but what it assumes when you use it.

If we had many ordered categorical responses and wanted to calculate a correlation matrix, we would need to use the hetcor() function, also in the {polycor} package. Let’s simulate some data and demonstrate. This time we create a 4 x 4 correlation matrix (for four variables), save the simulated data as a data frame, and “apply” the cut() function to all four columns.


set.seed(1)
# create symmetric 4 x 4 correlation matrix
s <- matrix(c(1  , 0.4, 0.6, 0.7,
              0.4, 1  , 0.5, 0.2,
              0.6, 0.5,  1,  0.3,
              0.7, 0.2, 0.3,   1), byrow = T, nrow = 4)
d <- as.data.frame(rmvnorm(1000, mean = c(0, 0, 0, 0), sigma = s))
names(d) <- paste0("x", 1:4)
d[] <- lapply(d, function(x)cut(x, breaks = c(-Inf, -1, -0.5, 0.5, 1, Inf), labels = 1:5))

A quick glance shows we have four columns of ordered categorical data.


head(d)


  x1 x2 x3 x4
1  3  3  2  5
2  4  2  3  4
3  5  3  5  4
4  2  1  3  3
5  4  5  5  4
6  3  4  3  1

To calculate a polychoric correlation matrix for all four columns, use the hetcor() function, save the result, and extract the “correlations” element from the object.


cor.out <- hetcor(d)
cor.out$correlations


          x1        x2        x3        x4
x1 1.0000000 0.4247052 0.6000880 0.7130510
x2 0.4247052 1.0000000 0.4845549 0.2325747
x3 0.6000880 0.4845549 1.0000000 0.3222296
x4 0.7130510 0.2325747 0.3222296 1.0000000

With a correlation matrix, we can visualize the correlations using a package such as {corrplot} (Wei and Simko 2024). For example:


# install.packages("corrplot")
library(corrplot)
corrplot(cor.out$correlations, diag = FALSE, addCoef.col = 'black')

Correlation plot of correlation matrix.

Please note that calculating many polychoric correlations can be computationally demanding. For that reason, the hetcor() function offers the ability to perform parallel computations on a computer with multiple cores. See help(hetcor) for more information.

So how is polychoric correlation calculated exactly? It turns out there are two ways to do it: (1) the “two-step” method and (2) maximum likelihood. Let’s first look at the two-step method since that’s the default of the polychor() function.

To begin, we simulate new data with fewer thresholds to make this demonstration easier to follow.


set.seed(12345)
d <- rmvnorm(1000, c(0, 0), matrix(c(1, .5, .5, 1), 2, 2))
x <- cut(d[,1], c(-Inf, -1, 1, Inf), labels = 1:3)
y <- cut(d[,2], c(-Inf, -1, 1, Inf), labels = 1:3)

The first thing we need to do is estimate the thresholds, or where we cut our continuous data to create ordered categories. Above we set the thresholds to -1 and 1, but in real life we won’t know the thresholds. To estimate these, we find the standard normal quantile of the cumulative row and column proportions. To do this we need a cross tabulation of the values:


tab <- table(x, y)
tab


   y
x     1   2   3
  1  59  90   5
  2  98 499  81
  3   5  95  68

Next we need cumulative row and column proportions. We can calculate those using the rowSums(), colSums(), and cumsum() functions.


n <- sum(tab)
# cumulative row proportions
r_prop <- cumsum(rowSums(tab))/n 
r_prop


    1     2     3 
0.154 0.832 1.000 


# cumulative column proportions
c_prop <- cumsum(colSums(tab))/n
c_prop


    1     2     3 
0.162 0.846 1.000 

Now we find the normal quantiles of these proportions using the qnorm() function. If you imagine a normal distribution, the normal quantile is the point on the x-axis to the left of which lies a specified area under the curve. For example, below we see that 0.154 of the area under a standard normal curve lies to the left of -1.0194276. Likewise we see that all of the area (1.000) lies to the left of Inf.


row_th <- qnorm(r_prop)
row_th


         1          2          3 
-1.0194276  0.9620988        Inf 


col_th <- qnorm(c_prop)
col_th


         1          2          3 
-0.9862713  1.0194276        Inf 

The first two values in each vector above are the thresholds and that completes step one of the “two-step” method. (Note that the print method for the polychor() function rounds these values. See help(print.polycor ) for more information.)

Now we’re ready to start step two, which is to find the maximum likelihood of our data. Specifically we want to find the most likely correlation of a bivariate standard normal distribution that generated our data. That will be our polychoric correlation.

Recall the cross-tabulation of our ordered categorical data:


tab


   y
x     1   2   3
  1  59  90   5
  2  98 499  81
  3   5  95  68

Using our estimated thresholds and a specified correlation, we can calculate the probability of being in a cell for each of the nine cells in the 3 x 3 table. For example, we see 59 observations where x = 1 and y = 1. What is the joint probability of sampling an observation where x = 1 and y = 1 assuming a correlation of, say, 0.46? Remember, these discrete values came from chopping our continuous data into bins. So when we estimate the joint probability of x = 1 and y = 1, what we really want is the following:

\[P(-\infty < x < -1.0194276, -\infty < y < -0.9862713 )\]

That is, we want the joint probability that x and y both fall within the ranges defined by our thresholds. To get this, we can use the pmvnorm() function in the {mvtnorm} package. The lower and upper arguments specify the boundaries between which we want to find the area under a bivariate normal distribution. Since we’re looking at cell 1 in the upper left corner, the lower limits are both -Inf. The upper limits are the first row and column thresholds, respectively.


pmvnorm(lower = c(-Inf, -Inf), upper = c(row_th[1], col_th[1]),
        corr = matrix(c(1, 0.46, 0.46, 1), ncol = 2))


[1] 0.05842947
attr(,"error")
[1] 1e-15
attr(,"msg")
[1] "Normal Completion"

So assuming a correlation of 0.46, there’s a joint probability of about 0.058 that we’ll see an observation in the ranges that define x = 1 and y = 1.

To do this for all cells, we can construct a for loop:


# create vectors of thresholds with -Inf boundary
row.cuts <- c(-Inf, row_th)
col.cuts <- c(-Inf, col_th)
# create empty matrix to store probabilities
P <- matrix(0, nrow = 3, ncol = 3)
# loop through the cut points to get all 9 probabilities
for (i in 1:3){
  for (j in 1:3){
    P[i, j] <- pmvnorm(lower = c(row.cuts[i], col.cuts[j]), 
                       upper = c(row.cuts[i + 1], col.cuts[j + 1]),
                       corr = matrix(c(1, 0.46, 0.46, 1), ncol = 2))
  }
}
P


            [,1]       [,2]       [,3]
[1,] 0.058429465 0.09117881 0.00439172
[2,] 0.098066361 0.49031464 0.08961899
[3,] 0.005504174 0.10250654 0.05998929

Now the likelihood of this result can be calculated by taking the log of the probabilities, multiplying by the observed counts in each cell, and taking the sum. This is the log likelihood.


sum(tab * log(P))


[1] -1622.565

What we want to do is find the maximum log likelihood for a given correlation. To do that, we first convert our for loop into a function with correlation as an argument and verify that it works.


loglik <- function(rho){
  P <- matrix(0, nrow = 3, ncol = 3)
  R <- matrix(c(1, rho, rho, 1), 2, 2)
  for (i in 1:3) {
    for (j in 1:3) {
      P[i, j] <- pmvnorm(lower = c(row.cuts[i], col.cuts[j]),
                         upper = c(row.cuts[i + 1], col.cuts[j + 1]), 
                         corr = R)
    }
  }
  sum(tab * log(P))
}
loglik(0.46)


[1] -1622.565

When correlation is 0.46, the log likelihood is -1622.565. What is the log likelihood when correlation is 0.47?


loglik(0.47)


[1] -1622.343

-1622.343 > -1622.565, so 0.47 seems more likely than 0.46.

We can programmatically find the correlation that produces the maximum log likelihood using the optimize() function. We set the interval argument to consider correlations ranging from -0.99 to 0.99, and set the maximum argument to TRUE since we want to find the maximum log likelihood as opposed to the minimum (the default).


optimize(loglik, interval = c(-0.99, 0.99), maximum = TRUE)


$maximum
[1] 0.4920584

$objective
[1] -1622.139

The correlation that produces the maximum log likelihood is 0.4920584. This is the polychoric correlation. We can check that this is what polychor() returns.


polychor(x, y)


[1] 0.4920583

The alternative method to calculating polychoric correlation is called “maximum likelihood” in the polychor() documentation. This differs from the “two-step” method in that it uses maximum likelihood to estimate the thresholds as well as the polychoric correlation. To use this method, set ML = TRUE. You’ll notice the result is not very different from the “two-step” method.


polychor(x, y, ML = TRUE, thresholds = TRUE)



Polychoric Correlation,  ML est.  =  0.4921 


  Row Thresholds         
1 -1.0180
2  0.9622


  Column Thresholds         
1 -0.9854
2  1.0200

We can modify our log likelihood function above to carry out this calculation. The first thing we need to do is move the assignment of thresholds into the function. Next, we need to modify the function argument to take a vector of values. The function argument, parms, takes 5 values: the correlation and the two row and column thresholds, respectively. You’ll notice we have to refer to these values using indexing brackets in the function. When we test the function, we need to provide five values. Finally, we change the last line to calculate the negative log likelihood.


loglik <- function(parms){
  row.cuts <- c(-Inf, parms[2:3], Inf)
  col.cuts <- c(-Inf, parms[4:5], Inf)
  P <- matrix(0, nrow = 3, ncol = 3)
  R <- matrix(c(1, parms[1], parms[1], 1), 2, 2)
  for (i in 1:3) {
    for (j in 1:3) {
      P[i, j] <- pmvnorm(lower = c(row.cuts[i], col.cuts[j]),
                         upper = c(row.cuts[i + 1], col.cuts[j + 1]), 
                         corr = R)
    }
  }
  -sum(tab * log(P))
}
loglik(parms = c(0.45, c(-0.6, 0.7), c(-0.8, 0.7)))


[1] 1742.688

Now we want to find the five values that minimize the negative log likelihood. For this we turn to the optim() function. Unlike optimize(), which only works for one value, the optim() function can find a minimum based on multiple values. To use the optim() function we need to supply initial values to get it started in its search. Below we supply five values that seem reasonable. We save the function result and inspect the par element.


result <- optim(par = c(0.48, c(-0.7, 0.7), c(-0.7, 0.7)), 
      fn = loglik)
result$par


[1]  0.4921782 -1.0177157  0.9622193 -0.9854590  1.0196164

The first element is the polychoric correlation. The next two are the estimated thresholds for the row variable. The last two are the estimated thresholds for the column variable. These match the polychor() output above.


References

  • Drasgow F (1986) Polychoric and polyserial correlations. Pp. 68–74 in S. Kotz and N. Johnson, eds., The Encyclopedia of Statistics, Volume 7. Wiley.
  • Fox J (2022). polycor: Polychoric and Polyserial Correlations. R package version 0.8-1, https://CRAN.R-project.org/package=polycor.
  • Genz A, Bretz F (2009). Computation of Multivariate Normal and t Probabilities, series Lecture Notes in Statistics. Springer-Verlag, Heidelberg. ISBN 978-3-642-01688-2. R package version 1.3-1, https://CRAN.R-project.org/package=mvtnorm.
  • R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
  • Wei T and Simko V (2024). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.94). Available from https://github.com/taiyun/corrplot

Clay Ford
Research Data Scientist
University of Virginia Library
October 4, 2024


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.