Distribution-Free Confidence Intervals for Percentiles

Percentiles are order statistics. This means they’re determined by ordering observations from smallest to largest and then finding the value below which some percentage of the data lie. The most common percentile is the median. It’s simply the middle value (or the average of the two middle values if there are an even number of observations). Fifty percent of the data lie below the median. Other percentiles frequently of interest are the 25th and 75th percentiles. These are the data values below which lie 25 and 75 percent of the data, respectively. In R, calling summary() or quantile() on a vector of numeric data returns all three of these percentiles. Let’s look at an example.

McAfee (2025) describes the location and sizes of spider webs in Dean Creek Marsh, located on the southern end of Sapelo Island, Georgia, USA. There are 40 measurements in all. One of the measurements made was the height of the web above ground (in cm) measured from the bottom edge of the web. Below we read in the data and summarize the HeightCm variable.


d <- read.csv("https://static.lib.virginia.edu/statlab/materials/data/Webs.csv")
nrow(d)


[1] 40


summary(d$HeightCm)


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   9.00   20.50   32.00   34.36   47.00   87.00       1

Using quantile() we need to set na.rm = TRUE since there is a missing value.


quantile(d$HeightCm, na.rm = TRUE)


  0%  25%  50%  75% 100% 
 9.0 20.5 32.0 47.0 87.0

A histogram shows the heights are slightly skewed right. The median seems to be a better measure of center than the mean.


hist(d$HeightCm)

Histogram of spider web heights that shows a right skew.

Understanding Ranks

The median height of spider webs in the marsh is estimated to be 32 cm. Since there is a missing value, we have 39 observations and thus the median is the middle of the data when sorted from smallest to largest value. We can find the rank of the value using the formula \(r = (n + 1)p\), where p is the percentile. The rank of a value is the position an observation occupies when all the values are sorted from smallest to largest. For 39 values, we can find the rank of the median as follows:


(39 + 1)*0.5


[1] 20

The 20th value of the sorted height data is the median.


# sort() drops NA values by default
sort(d$HeightCm)[20]


[1] 32

The rank order of the 25th and 75th percentiles for 39 observations are as follows:


(39 + 1) * c(0.25, 0.75)


[1] 10 30

Extracting the 10th and 30th values from the sorted data returns the following:


sort(d$HeightCm)[c(10, 30)]


[1] 20 47

Notice the 25th percentile is slightly different than what the summary() and quantile() functions reported. This is because they use a different method to calculate percentiles. There are nine different methods available in R for calculating percentiles. To replicate the result we obtained, we need to set the quantile type to number 6.


# default is quantile.type = 7
summary(d$HeightCm, quantile.type = 6)


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   9.00   20.00   32.00   34.36   47.00   87.00       1


# default is type = 7
quantile(d$HeightCm, type = 6, na.rm = TRUE)


  0%  25%  50%  75% 100% 
   9   20   32   47   87

The terms quantile and percentile are mostly interchangeable. Quantiles can take any value between 0 and 1. Percentiles are quantiles converted to percentages. See this StackExchange discussion for a more nuanced discussion of the differences. Deciding which type of quantile/percentile to calculate is usually not something you need to worry about. It’s fine to accept the default. The method we presented above is the “classic” textbook definition of order statistics.

Confidence Intervals for Medians

When we have a sample of data, a percentile is just an estimate. We would no doubt get a different estimate with a new sample of data. Therefore it’s desirable to summarize the uncertainty of the estimate. Confidence intervals allow us to do this.

Traditional confidence intervals assume the distribution of the statistic is approximately normal. This allows us to easily calculate a standard error with which we can determine a margin of error. (See this StatLab article for a deeper dive into traditional confidence intervals.) However, calculating standard errors for order statistics such as percentiles is very tricky. An alternative approach is to make no assumption about the distribution of the percentile and instead determine the probability a percentile lies in a given interval.

Going back to our spider web data, we estimated the median height to be 32 cm. This was the 20th value when we ordered the data from smallest to largest. What is the probability the true median lies between, say, the 15th and 25th ranked values? If we assume the data is independent and that the probability of a given value being less than the true median is 0.5, we can estimate this probability using a binomial distribution.

To make this clear, recall the median is the middle of the data. Since it’s in the middle, we have probability of 0.5 of sampling a value below the median. For the interval of values ranked 15 to 25 to capture the true median, at least one of the values must be less than the median and at least one must be greater. Assuming a sample size of 39, this suggests we can determine the probability that values ranked 15 to 25 capture the median as follows:

\[P(Y_{15} < \text{median} < P_{25}) = \sum_{k = 15}^{24} {39 \choose x} 0.5^k0.5^{39-k}\]

When k = 15, this means the first 15 values are less than the median and all others are greater. This implies the interval between the 15th and 25th ranked values captures the true median. We can calculate the probability of this happening using the dbinom() function.


dbinom(x = 15, size = 39, prob = 0.5)


[1] 0.04573092

So there’s a probability of about 0.045 that the interval created by the 15th and 25th ranked values will contain the median where only the 15th ranked value is less than the median.

When k = 16, this means the first 16 values are less than the median and all others are greater. This also implies the interval between then 15th and 25th ranked values captures the true median. This probability is


dbinom(x = 16, size = 39, prob = 0.5)


[1] 0.06859638

We need to sum all probabilities for k ranging from 15 to 24 to get the total probability that the median lies between the 15th and 25th ranked values. Why only to 24? Because if the 25th ranked value is less than the true median, then the interval formed by the 15th and 25th ranked values does not contain the true median.


sum(dbinom(15:24, size = 39, prob = 0.5))


[1] 0.891871

So extracting the 15th and 25th values and forming an interval provides us an approximate 89% confidence interval. The interval created by these values is


c(sort(d$HeightCm)[15], sort(d$HeightCm)[25])


[1] 28 38

Since the calculation of this interval makes no assumptions about the distribution of the median, this type of confidence interval is called a distribution-free confidence interval or a nonparametric confidence interval.

Fortunately, the DescTools package provides the QuantileCI() function to carry out these calculations for us. If you don’t have this package, you can install it by running install.packages("DescTools"). To specify the median, we set the probs argument to 0.5.


library(DescTools)
QuantileCI(d$HeightCm, probs = 0.5, conf.level = 0.89, na.rm = TRUE)


   est lwr.ci upr.ci 
    32     28     38 
attr(,"conf.level")
[1] 0.891871

If we wanted a traditional 95% confidence interval, we would need to do some searching to find the upper and lower ranked values using the approach above. However, the QuantileCI() function does this for us when we set conf.level = 0.95.


QuantileCI(d$HeightCm, probs = 0.5, conf.level = 0.95, na.rm = TRUE)


   est lwr.ci upr.ci 
    32     24     38 
attr(,"conf.level")
[1] 0.9615227

We estimate the median height of spider webs to be 32 with a 95% confidence interval of (24, 38). Notice we’re using parentheses instead of square brackets to report the interval. That’s because the interval is formed based on the probability of the median being between these two values. Notice also in the output the confidence level is actually 96%, not 95%. This is because we’re limited to calculating discrete probability intervals based on only 39 observations. Let’s explain what we mean by this. First get the rank order for the values 24 and 38 using the which() function.


c(which(sort(d$HeightCm) == 24), 
  which(sort(d$HeightCm) == 38))


[1] 13 25 26

The height of 24 cm is the 13th largest value. The height of 38 cm is both the 25th and 26th largest values. In other words, there are two webs that are 38 cm off the ground. The ranking of 38 is therefore the average of 25 and 26: 25.5. This implies the upper bound is the 26th value. The probability the median lies between the 13th and 26th values is


sum(dbinom(13:25, size = 39, prob = 0.5))


[1] 0.9615227

This is the closest we can get to 0.95 without going less than 0.95.

We can verify this by trying all possible intervals around the 20th ranked value (the median). Below we form every combination of intervals between 1-19 and 21-39 using the expand.grid() function.


ints <- expand.grid(lower = 1:19, upper = 21:39)
head(ints) # view first six intervals


  lower upper
1     1    21
2     2    21
3     3    21
4     4    21
5     5    21
6     6    21

Next we write a function to calculate the probability interval and apply it to the combinations of lower and upper intervals using the mapply() function.


f <- function(i, j){
  sum(dbinom(i:j, size = 39, prob = 0.5))
}
ints$prob_int <- mapply(f, i = ints$lower, j = ints$upper)

Then we use the dplyr package to calculate the difference between the probability interval and 0.95, filter the data for those differences greater than 0, and return the rows of the data with minimum differences.


library(dplyr)
ints |> 
  mutate(diff = prob_int - 0.95) |> 
  filter(diff > 0) |> 
  slice_min(diff)


  lower upper  prob_int       diff
1    13    25 0.9615227 0.01152269
2    14    26 0.9615227 0.01152269

We see there are two possible intervals that get closest to 95% without going less than 95%:


rbind(c(sort(d$HeightCm)[13], sort(d$HeightCm)[25]),
      c(sort(d$HeightCm)[14], sort(d$HeightCm)[26]))


     [,1] [,2]
[1,]   24   38
[2,]   25   38

In cases of ties like this, reporting the wider interval of (24, 38) formed by the 13th and 25th values would be the recommended choice since it’s more conservative. That’s what the QuantileCI() function does.

Confidence Intervals for Other Percentiles

This same procedure can be applied to other percentiles. For example, recall the 25th percentile was the 10th ranked value.


sort(d$HeightCm)[10]


[1] 20

What interval is formed by taking the 5th and 15th ranked values? Notice this time we change the probability of the binomial distribution to 0.25 since we’re considering the probability of a value falling below the 25th percentile.


sum(dbinom(5:14, size = 39, prob = 0.25))


[1] 0.9366887

An interval formed with the 5th and 15th ranked values forms about a 93% confidence interval.


sort(d$HeightCm)[c(5,15)]


[1] 15 28

Using the QuantilCI() function gives the same result.


QuantileCI(x = d$HeightCm, probs = 0.25, conf.level = 0.93, na.rm = TRUE)


   est lwr.ci upr.ci 
  20.5   15.0   28.0 
attr(,"conf.level")
[1] 0.9366887

Notice the estimated percentile is calculated using the default method of the summary() and quantile() functions.

Of course we would most likely calculate a 95% confidence interval. This is the default of the QuantileCI() function. If we don’t specify percentiles, the function will return 95% confidence intervals for the median, the 25th, and the 75th percentiles.


QuantileCI(d$HeightCm, na.rm = TRUE)


      est lwr.ci upr.ci
0%    9.0     NA      9
25%  20.5     12     28
50%  32.0     24     38
75%  47.0     38     57
100% 87.0     60     NA
attr(,"conf.level")
[1] 1.0000000 0.9503046 0.9615227 0.9503046 1.0000000

Notice all the confidence levels are slightly different from 95% because of the discrete ranges of probability based on the ranked positions.

Bootstrap Confidence Intervals

Another distribution-free approach to calculating confidence intervals for percentiles is the bootstrap. This involves resampling the data with replacement, calculating the percentile of interest, repeating many times, and then summarizing the distribution of bootstrapped percentiles to get a confidence interval. This is easy to implement using the boot package that comes installed with R. First we define a function to calculate the percentile of interest, in this case the median. Notice the function has two arguments: one for the data (x) and one for resampling (i). The i argument is needed to resample values with replacement using indexing brackets. We then use the boot() function to perform the bootstrap resampling. We set R = 1000 to perform 1000 bootstrap replicates. When finished, we call the boot.ci() function to produce several types of bootstrap confidence intervals.


library(boot)
f <- function(x, i)quantile(x[i], probs = 0.5, na.rm = TRUE)
bout <- boot(data = d$HeightCm, 
             statistic = f, 
             R = 1000)
boot.ci(bout)


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bout)

Intervals : 
Level      Normal              Basic         
95%   (25.15, 39.07 )   (26.00, 39.00 )  

Level     Percentile            BCa          
95%      (25, 38 )           (24, 38 )  
Calculations and Intervals on Original Scale

All the confidence intervals are similar. Choosing which interval to report in this case is probably not that important. Carpenter and Bithell (2000) describe the various types of bootstrap confidence intervals and provide guidance on which one to choose. Based on our data and bootstrapping method, they would recommend reporting the BCa (Bias corrected and accelerated) method.

We can also get bootstrap confidence intervals using the QuantileCI() function by simply setting method = "boot".


QuantileCI(d$HeightCm, probs = 0.50, na.rm = TRUE, 
           method = "boot", R = 1000)


     est   lwr.ci   upr.ci 
32.00000 26.00000 39.97459

This returns the “Basic” type of bootstrap confidence interval. There is no option to pick a different type.

Hopefully you now have a better understanding of how to summarize uncertainty of estimated percentiles using a distribution-free approach.

R session details

The analysis was done using the R Statistical language (v4.5.2; R Core Team, 2025) on Windows 11 x64, using the packages boot (v1.3.32), DescTools (v0.99.60) and dplyr (v1.1.4).

References

Canty A, Ripley B (2025). boot: Bootstrap Functions. doi:10.32614/CRAN.package.boot https://doi.org/10.32614/CRAN.package.boot, R package version 1.3-32, https://CRAN.R-project.org/package=boot.
Carpenter J & Bithell J (2000), Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statist. Med., 19: 1141-1164 https://doi.org/10.1002/(SICI)1097-0258(20000515)19:9<1141::AID-SIM479>3.0.CO;2-F.
Hogg RV & Tanis EA (2006), Probability and Statistical Inference. Prentice Hall. (Section 6.10)
luciano (https://stats.stackexchange.com/users/12492/luciano), Percentile vs quantile vs quartile, URL (version: 2015-06-13): https://stats.stackexchange.com/q/156778.
McAfee, B.J. 2025. Spider web distribution and characteristics in Dean Creek Marsh, Sapelo Island, Georgia, USA, October 2024 ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/66971c5358dfe7b15bf460653e077562 (Accessed 2025-11-27).
R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Signorell A (2025). DescTools: Tools for Descriptive Statistics. doi:10.32614/CRAN.package.DescTools https://doi.org/10.32614/CRAN.package.DescTools, R package version 0.99.60, https://CRAN.R-project.org/package=DescTools.
Wickham H, Francois R, Henry L, Muller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. doi:10.32614/CRAN.package.dplyr https://doi.org/10.32614/CRAN.package.dplyr, R package version 1.1.4, https://CRAN.R-project.org/package=dplyr.