Polychoric correlation is a measure of association between two ordered categorical variables, each assumed to represent latent continuous variables that have a bivariate standard normal distribution. When we say two variables have a bivariate standard normal distribution, we mean they’re both normally distributed with mean 0 and standard deviation 1, and that they have linear correlation. Polychoric correlation might be suitable for quantifying association between, say, questionnaire items that ask respondents to rate their satisfaction with various services on a scale of 1 to 5. In this scenario it may be plausible to imagine the ordered categorical “satisfaction” response as representing a cutpoint, or threshold, of a continuous latent variable.

We can generate this type of data by first drawing a random sample from a bivariate standard normal distribution. We’ll use R to demonstrate (R Core Team 2024). Below we us the `rmvnorm()`

function from the {mvtnorm} package (Genz and Bretz 2009) to randomly sample 1000 observations from a bivariate standard normal distribution with correlation 0.5. This package does not come with R, so you’ll need to install it if you don’t already have it. Since this is a bivariate distribution, we need to specify two means, hence the vector `c(0,0)`

. Notice also that the correlation needs to be supplied as a 2 x 2 matrix. The `set.seed(99)`

function allows you to simulate the same data if you wish to follow along.

```
# install.packages("mvtnorm")
library(mvtnorm)
set.seed(99)
d <- rmvnorm(1000, mean = c(0, 0),
sigma = matrix(c(1, 0.5, 0.5, 1), 2, 2))
```

The result is a two-column matrix:

```
head(d)
```

```
[,1] [,2]
[1,] 0.3308166 0.51869175
[2,] 0.1997150 0.45146614
[3,] -0.3187241 0.02458465
[4,] -0.7076863 0.24936114
[5,] -0.6866844 -1.34438217
[6,] -0.4818428 0.69713006
```

We can calculate the Pearson correlation using the `cor()`

function. Notice it’s close to the “true” value of 0.5 we used to simulate the data.

```
cor(d)
```

```
[,1] [,2]
[1,] 1.0000000 0.5089789
[2,] 0.5089789 1.0000000
```

Now let’s chop this data into five bins using the `cut()`

function and create a cross tabulation. We arbitrarily set the interior breaks to -1, 0.5, 0.5, and 1. We also set the lower and upper bounds to infinity to ensure we capture all values. Since there are six total break values, we’ll have five bins which we label 1 through 5.

```
x <- cut(d[,1], breaks = c(-Inf, -1, -0.5, 0.5, 1, Inf), labels = 1:5)
y <- cut(d[,2], breaks = c(-Inf, -1, -0.5, 0.5, 1, Inf), labels = 1:5)
table(x, y)
```

```
y
x 1 2 3 4 5
1 72 34 40 7 5
2 39 33 51 17 15
3 40 50 166 59 68
4 7 14 68 25 37
5 3 10 47 38 55
```

We might pretend this is a cross-tabulation of two survey questions that ask customers to rate satisfaction on two different aspects of a company’s service. Perhaps the scale is 1 to 5, where 1 is poor and 5 is excellent.

To calculate the polychoric correlation between these two variables, we can use the `polychor()`

function from the {polycor} package (Fox 2022). Again you will need to install this package if you don’t already have it.

```
# install.packages("polycor")
library(polycor)
polychor(x, y)
```

```
[1] 0.4967263
```

Notice the result is quite close to the “true” correlation of 0.5 we used to generate the original continuous data.

Since we’re assuming this ordered categorical data is based on a latent continuous variable that we’ve grouped into bins, we can request the estimated cutpoints by setting `thresholds = TRUE`

. We see the estimated thresholds are very close to the ones we specified when we chopped our data into bins.

```
polychor(x, y, thresholds = TRUE)
```

```
Polychoric Correlation, 2-step est. = 0.4967
Row Thresholds
1 -1.0030
2 -0.4874
3 0.5129
4 1.0240
Column Thresholds
1 -0.9904
2 -0.5187
3 0.4510
4 0.9154
```

We can also request a standard error estimate for the polychoric correlation by setting `std.err = TRUE`

. Below the standard error is estimated to be about 0.03. This also returns a test of bivariate normality. The null hypothesis is the latent data underlying the ordinal categories is bivariate normal. With a p-value of 0.1997, we find no reason to reject that hypothesis.

```
polychor(x, y, thresholds = TRUE, std.err = TRUE)
```

```
Polychoric Correlation, 2-step est. = 0.4967 (0.02567)
Test of bivariate normality: Chisquare = 19.32, df = 15, p = 0.1997
Row Thresholds
1 -1.0030
2 -0.4874
3 0.5129
4 1.0240
Column Thresholds
1 -0.9904
2 -0.5187
3 0.4510
4 0.9154
```

This example we just presented is a slight modification of the example provided on the help page for the `polychor()`

function (see `help(polychor)`

). The help page demonstrates not only how to use the `polychor()`

function but what it assumes when you use it.

If we had many ordered categorical responses and wanted to calculate a correlation matrix, we would need to use the `hetcor()`

function, also in the {polycor} package. Let’s simulate some data and demonstrate. This time we create a 4 x 4 correlation matrix (for four variables), save the simulated data as a data frame, and “apply” the `cut()`

function to all four columns.

```
set.seed(1)
# create symmetric 4 x 4 correlation matrix
s <- matrix(c(1 , 0.4, 0.6, 0.7,
0.4, 1 , 0.5, 0.2,
0.6, 0.5, 1, 0.3,
0.7, 0.2, 0.3, 1), byrow = T, nrow = 4)
d <- as.data.frame(rmvnorm(1000, mean = c(0, 0, 0, 0), sigma = s))
names(d) <- paste0("x", 1:4)
d[] <- lapply(d, function(x)cut(x, breaks = c(-Inf, -1, -0.5, 0.5, 1, Inf), labels = 1:5))
```

A quick glance shows we have four columns of ordered categorical data.

```
head(d)
```

```
x1 x2 x3 x4
1 3 3 2 5
2 4 2 3 4
3 5 3 5 4
4 2 1 3 3
5 4 5 5 4
6 3 4 3 1
```

To calculate a polychoric correlation matrix for all four columns, use the `hetcor()`

function, save the result, and extract the “correlations” element from the object.

```
cor.out <- hetcor(d)
cor.out$correlations
```

```
x1 x2 x3 x4
x1 1.0000000 0.4247052 0.6000880 0.7130510
x2 0.4247052 1.0000000 0.4845549 0.2325747
x3 0.6000880 0.4845549 1.0000000 0.3222296
x4 0.7130510 0.2325747 0.3222296 1.0000000
```

With a correlation matrix, we can visualize the correlations using a package such as {corrplot} (Wei and Simko 2024). For example:

```
# install.packages("corrplot")
library(corrplot)
corrplot(cor.out$correlations, diag = FALSE, addCoef.col = 'black')
```

Please note that calculating many polychoric correlations can be computationally demanding. For that reason, the `hetcor()`

function offers the ability to perform parallel computations on a computer with multiple cores. See `help(hetcor)`

for more information.

So how is polychoric correlation calculated exactly? It turns out there are two ways to do it: (1) the “two-step” method and (2) maximum likelihood. Let’s first look at the two-step method since that’s the default of the `polychor()`

function.

To begin, we simulate new data with fewer thresholds to make this demonstration easier to follow.

```
set.seed(12345)
d <- rmvnorm(1000, c(0, 0), matrix(c(1, .5, .5, 1), 2, 2))
x <- cut(d[,1], c(-Inf, -1, 1, Inf), labels = 1:3)
y <- cut(d[,2], c(-Inf, -1, 1, Inf), labels = 1:3)
```

The first thing we need to do is estimate the thresholds, or where we cut our continuous data to create ordered categories. Above we set the thresholds to -1 and 1, but in real life we won’t know the thresholds. To estimate these, we find the standard normal quantile of the cumulative row and column proportions. To do this we need a cross tabulation of the values:

```
tab <- table(x, y)
tab
```

```
y
x 1 2 3
1 59 90 5
2 98 499 81
3 5 95 68
```

Next we need cumulative row and column proportions. We can calculate those using the `rowSums()`

, `colSums()`

, and `cumsum()`

functions.

```
n <- sum(tab)
# cumulative row proportions
r_prop <- cumsum(rowSums(tab))/n
r_prop
```

```
1 2 3
0.154 0.832 1.000
```

```
# cumulative column proportions
c_prop <- cumsum(colSums(tab))/n
c_prop
```

```
1 2 3
0.162 0.846 1.000
```

Now we find the normal quantiles of these proportions using the `qnorm()`

function. If you imagine a normal distribution, the normal quantile is the point on the x-axis to the left of which lies a specified area under the curve. For example, below we see that 0.154 of the area under a standard normal curve lies to the left of -1.0194276. Likewise we see that all of the area (1.000) lies to the left of `Inf`

.

```
row_th <- qnorm(r_prop)
row_th
```

```
1 2 3
-1.0194276 0.9620988 Inf
```

```
col_th <- qnorm(c_prop)
col_th
```

```
1 2 3
-0.9862713 1.0194276 Inf
```

The first two values in each vector above are the thresholds and that completes step one of the “two-step” method. (Note that the print method for the `polychor()`

function rounds these values. See `help(print.polycor )`

for more information.)

Now we’re ready to start step two, which is to find the *maximum likelihood* of our data. Specifically we want to find the most likely *correlation* of a bivariate standard normal distribution that generated our data. That will be our polychoric correlation.

Recall the cross-tabulation of our ordered categorical data:

```
tab
```

```
y
x 1 2 3
1 59 90 5
2 98 499 81
3 5 95 68
```

Using our estimated thresholds and a specified correlation, we can calculate the probability of being in a cell for each of the nine cells in the 3 x 3 table. For example, we see 59 observations where x = 1 and y = 1. What is the joint probability of sampling an observation where x = 1 and y = 1 assuming a correlation of, say, 0.46? Remember, these discrete values came from chopping our continuous data into bins. So when we estimate the joint probability of x = 1 and y = 1, what we really want is the following:

\[P(-\infty < x < -1.0194276, -\infty < y < -0.9862713 )\]

That is, we want the joint probability that x and y both fall *within the ranges* defined by our thresholds. To get this, we can use the `pmvnorm()`

function in the {mvtnorm} package. The `lower`

and `upper`

arguments specify the boundaries between which we want to find the area under a bivariate normal distribution. Since we’re looking at cell 1 in the upper left corner, the lower limits are both `-Inf`

. The upper limits are the first row and column thresholds, respectively.

```
pmvnorm(lower = c(-Inf, -Inf), upper = c(row_th[1], col_th[1]),
corr = matrix(c(1, 0.46, 0.46, 1), ncol = 2))
```

```
[1] 0.05842947
attr(,"error")
[1] 1e-15
attr(,"msg")
[1] "Normal Completion"
```

So assuming a correlation of 0.46, there’s a joint probability of about 0.058 that we’ll see an observation in the ranges that define x = 1 and y = 1.

To do this for all cells, we can construct a for loop:

```
# create vectors of thresholds with -Inf boundary
row.cuts <- c(-Inf, row_th)
col.cuts <- c(-Inf, col_th)
# create empty matrix to store probabilities
P <- matrix(0, nrow = 3, ncol = 3)
# loop through the cut points to get all 9 probabilities
for (i in 1:3){
for (j in 1:3){
P[i, j] <- pmvnorm(lower = c(row.cuts[i], col.cuts[j]),
upper = c(row.cuts[i + 1], col.cuts[j + 1]),
corr = matrix(c(1, 0.46, 0.46, 1), ncol = 2))
}
}
P
```

```
[,1] [,2] [,3]
[1,] 0.058429465 0.09117881 0.00439172
[2,] 0.098066361 0.49031464 0.08961899
[3,] 0.005504174 0.10250654 0.05998929
```

Now the likelihood of this result can be calculated by taking the log of the probabilities, multiplying by the observed counts in each cell, and taking the sum. This is the *log likelihood*.

```
sum(tab * log(P))
```

```
[1] -1622.565
```

What we want to do is find the *maximum* log likelihood for a given correlation. To do that, we first convert our for loop into a function with correlation as an argument and verify that it works.

```
loglik <- function(rho){
P <- matrix(0, nrow = 3, ncol = 3)
R <- matrix(c(1, rho, rho, 1), 2, 2)
for (i in 1:3) {
for (j in 1:3) {
P[i, j] <- pmvnorm(lower = c(row.cuts[i], col.cuts[j]),
upper = c(row.cuts[i + 1], col.cuts[j + 1]),
corr = R)
}
}
sum(tab * log(P))
}
loglik(0.46)
```

```
[1] -1622.565
```

When correlation is 0.46, the log likelihood is -1622.565. What is the log likelihood when correlation is 0.47?

```
loglik(0.47)
```

```
[1] -1622.343
```

-1622.343 > -1622.565, so 0.47 seems more likely than 0.46.

We can programmatically find the correlation that produces the maximum log likelihood using the `optimize()`

function. We set the `interval`

argument to consider correlations ranging from -0.99 to 0.99, and set the `maximum`

argument to TRUE since we want to find the maximum log likelihood as opposed to the minimum (the default).

```
optimize(loglik, interval = c(-0.99, 0.99), maximum = TRUE)
```

```
$maximum
[1] 0.4920584
$objective
[1] -1622.139
```

The correlation that produces the maximum log likelihood is 0.4920584. This is the polychoric correlation. We can check that this is what `polychor()`

returns.

```
polychor(x, y)
```

```
[1] 0.4920583
```

The alternative method to calculating polychoric correlation is called “maximum likelihood” in the `polychor()`

documentation. This differs from the “two-step” method in that it uses maximum likelihood to *estimate the thresholds* as well as the polychoric correlation. To use this method, set `ML = TRUE`

. You’ll notice the result is not very different from the “two-step” method.

```
polychor(x, y, ML = TRUE, thresholds = TRUE)
```

```
Polychoric Correlation, ML est. = 0.4921
Row Thresholds
1 -1.0180
2 0.9622
Column Thresholds
1 -0.9854
2 1.0200
```

We can modify our log likelihood function above to carry out this calculation. The first thing we need to do is move the assignment of thresholds into the function. Next, we need to modify the function argument to take a vector of values. The function argument, `parms`

, takes 5 values: the correlation and the two row and column thresholds, respectively. You’ll notice we have to refer to these values using indexing brackets in the function. When we test the function, we need to provide five values. Finally, we change the last line to calculate the *negative* log likelihood.

```
loglik <- function(parms){
row.cuts <- c(-Inf, parms[2:3], Inf)
col.cuts <- c(-Inf, parms[4:5], Inf)
P <- matrix(0, nrow = 3, ncol = 3)
R <- matrix(c(1, parms[1], parms[1], 1), 2, 2)
for (i in 1:3) {
for (j in 1:3) {
P[i, j] <- pmvnorm(lower = c(row.cuts[i], col.cuts[j]),
upper = c(row.cuts[i + 1], col.cuts[j + 1]),
corr = R)
}
}
-sum(tab * log(P))
}
loglik(parms = c(0.45, c(-0.6, 0.7), c(-0.8, 0.7)))
```

```
[1] 1742.688
```

Now we want to find the five values that *minimize* the negative log likelihood. For this we turn to the `optim()`

function. Unlike `optimize()`

, which only works for one value, the `optim()`

function can find a minimum based on multiple values. To use the `optim()`

function we need to supply initial values to get it started in its search. Below we supply five values that seem reasonable. We save the function result and inspect the `par`

element.

```
result <- optim(par = c(0.48, c(-0.7, 0.7), c(-0.7, 0.7)),
fn = loglik)
result$par
```

```
[1] 0.4921782 -1.0177157 0.9622193 -0.9854590 1.0196164
```

The first element is the polychoric correlation. The next two are the estimated thresholds for the row variable. The last two are the estimated thresholds for the column variable. These match the `polychor()`

output above.

## References

- Drasgow F (1986) Polychoric and polyserial correlations. Pp. 68–74 in S. Kotz and N. Johnson, eds., The Encyclopedia of Statistics, Volume 7. Wiley.
- Fox J (2022).
*polycor: Polychoric and Polyserial Correlations*. R package version 0.8-1, https://CRAN.R-project.org/package=polycor. - Genz A, Bretz F (2009).
*Computation of Multivariate Normal and t Probabilities*, series Lecture Notes in Statistics. Springer-Verlag, Heidelberg. ISBN 978-3-642-01688-2. R package version 1.3-1, https://CRAN.R-project.org/package=mvtnorm. - R Core Team (2024).
*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. - Wei T and Simko V (2024). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.94). Available from https://github.com/taiyun/corrplot

*Clay Ford*

*Research Data Scientist*

*University of Virginia Library*

*October 4, 2024*

**For questions or clarifications regarding this article, contact statlab@virginia.edu.**

View the entire collection of UVA Library StatLab articles, or learn how to cite.