Bootstrapped Association Rule Mining in R

Association rule mining is a machine learning technique designed to uncover associations between categorical variables in data. It produces association rules which reflect how frequently categories are found together. For instance, a common application of association rule mining is the “Frequently Bought Together” section of the checkout page from an online store. Through mining across transactions from that store, items that are frequently bought together have been identified (e.g., shaving cream and razors).

Association rules are structured as “if-then” statements. For instance, if a customer bought shaving cream, then they also bought razors. The “if-then” portions of association rules are often referred to as the antecedent (“if”) and the consequent (“then”). It is worth noting that association rules merely denote co-occurrence, not causal patterns. Based on the results of association rule mining, we cannot determine the cause of buying razors, merely that it is associated with purchasing shaving cream.

There are several algorithms available for association rule mining. We will focus on the Apriori algorithm for this article as it is one of the most common. The Apriori algorithm will search the data for the most frequently occurring sets of items. Typically, the antecedent can contain any number of items while the consequent will contain only one item. To evaluate the importance of each rule, several common metrics can be evaluated. For this article we will focus on only three: support, confidence, and lift. Support is defined as how often this item set appears in a dataset. Confidence represents the proportion of rules containing both the antecedent and the consequent. Lift represents how much more likely the consequent is to occur when the antecent is present as compared to when it is absent; it is the ratio of the confidence of a rule to the frequency of the consequent in the whole dataset.

Association rule mining in R

To conduct association rule mining in R (v4.4.0; R Core Team 2023), we will use the apriori() function from the arules package (v1.7-7; Hahsler et al., 2023). Within the arules package there is a dataset called Groceries. This dataset represents transactions at a grocery store for 1 month (30 days) from a real world grocery store. There are 9,835 transactions across 169 categories. First, we will load the arules package. Then, we will load in the Groceries dataset. If you don’t already have the arules package, you can install it by running install.packages(“arules”) in the console.


library(arules) # Load the arules package
data(Groceries) # Read Groceries into Global Environment

Let’s take a look at the Groceries dataset. To inspect the dataset, let’s first call the class() function to show us the class of the object.


class(Groceries) # Inspect class of Groceries dataset


[1] "transactions"
attr(,"package")
[1] "arules"

The Groceries dataset is of class “transactions”. This class of object is specific to the arules package. The functions in the arules package used to conduct association rule mining specifically take objects of class “transactions”. To convert an object to the class “transactions”, the transactions() function can be used. Objects of this class have a unique structure. First, these objects are S4 objects, meaning that to call anything contained within Groceries we need to use the “@” symbol. In order to look at the items contained in Groceries, we can call the itemInfo object by running Groceries@itemInfo. This would print a data frame of all the items and their information contained in the Groceries object. Since it’s a data frame, we can wrap this object in head() to look at the first six rows.


head(Groceries@itemInfo)


             labels  level2           level1
1       frankfurter sausage meat and sausage
2           sausage sausage meat and sausage
3        liver loaf sausage meat and sausage
4               ham sausage meat and sausage
5              meat sausage meat and sausage
6 finished products sausage meat and sausage

This shows us that there are three variables called labels, level2, and level1. The labels variable contains all of the items; level2 shows what category those items are in, and level1 shows the category that level2 is in.

The apriori() function in the arules package will allow us to apply the Apriori algorithm to these data. This function defaults to mining only rules with a minimum support value of 0.10, a minimum confidence value of 0.80, and a maximum of 10 items in the transactions. In order to prevent the function from running for too long, it will also time out checking for subsets after 5 seconds. Transaction lengths refer to the entire itemset, including both the antecedent and the consequent. This function defaults to a minimum transaction or itemset length of 1, meaning that a rule could be returned containing only the antecedent. Let’s run the first model using these preset values and save the model as an object called rules_1.


rules_1 <- apriori(Groceries)


Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.8    0.1    1 none FALSE            TRUE       5     0.1      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 983 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [8 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


rules_1


set of 0 rules 

Using the default setting, this returns 0 relevant rules. Let’s adjust some of the default settings and see how this affects the model. Within the apriori() function we will pass a named list to the argument called parameter. In this list, we will adjust minimum support values with an argument called supp, minimum confidence values with an argument called conf, and the maximum length of rules with an argument called maxlen. We will save this model to an object called rules_2.


rules_2 <- apriori(Groceries, parameter = list(supp = .01, # Minimum Support value
                                               conf = .5, # Minimum Confidence value
                                               maxlen = 20)) # Maximum rule length


Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5    0.01      1
 maxlen target  ext
     20  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 98 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [88 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [15 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


length(rules_2)


[1] 15

This returns a set of 15 rules. We can use the inspect() function to return the rules listed in this object.


rules_2_summary <- head(inspect(rules_2))


rules_2_summary


                                       lhs             rhs    support
[1]                         {curd, yogurt} => {whole milk} 0.01006609
[2]             {other vegetables, butter} => {whole milk} 0.01148958
[3]      {other vegetables, domestic eggs} => {whole milk} 0.01230300
[4]           {yogurt, whipped/sour cream} => {whole milk} 0.01087951
[5] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
[6]          {pip fruit, other vegetables} => {whole milk} 0.01352313
    confidence   coverage     lift count
[1]  0.5823529 0.01728521 2.279125    99
[2]  0.5736041 0.02003050 2.244885   113
[3]  0.5525114 0.02226741 2.162336   121
[4]  0.5245098 0.02074225 2.052747   107
[5]  0.5070423 0.02887646 1.984385   144
[6]  0.5175097 0.02613116 2.025351   133

In the output of rules_2 there are columns labeled rhs (right hand side) and lhs (left hand side). These refer to the antecedent (lhs) and consequent (rhs). The first rule can be read as “if curd and yogurt were purchased, then so was whole milk.”

The way we specified the apriori() function for rules_2 allowed it to search through all possible antecedent and consequent combinations. If we want to, we can also define which items we would like to have in the antecedent and which items should be in the consequent. A list of possible values can be given for consideration in the antecedent. The algorithm will search through possible combinations, ranging from no items to all items. A list of possible values will also be given for consideration in the consequent, however only one item can be present in the consequent within a single rule. The algorithm will produce a set of rules in which the consequents will contain one of the items from the list provided and the antecedents will contain any combination of items from the list provided. In the rules_3 object, we will use the appearance argument of the apriori() function to set these values. The appearance argument takes a list of rhs and lhs values.


rules_3 <- apriori(Groceries, 
                   # Consequent
                   appearance = list(rhs = c("whole milk", "other vegetables"),
                   # Antecedent
                   lhs = c("yogurt", "whipped/sour cream", "tropical fruit", "root vegetables",
                           "citrus fruit", "domestic eggs", "pip fruit", "butter", "curd", "rolls/buns")),
                   parameter = list(supp = .01, # Minimum Support value
                                    conf = .2, # Minimum Confidence value
                                    maxlen = 20)) # Maximum rule length


Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.2    0.1    1 none FALSE            TRUE       5    0.01      1
 maxlen target  ext
     20  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 98 

set item appearances ...[12 item(s)] done [0.00s].
set transactions ...[12 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [12 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [37 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


rules_3_summary <- head(inspect(rules_3))


rules_3_summary


                lhs                   rhs    support confidence   coverage
[1]              {} =>       {whole milk} 0.25551601  0.2555160 1.00000000
[2]          {curd} => {other vegetables} 0.01718353  0.3225191 0.05327911
[3]          {curd} =>       {whole milk} 0.02613116  0.4904580 0.05327911
[4]        {butter} => {other vegetables} 0.02003050  0.3614679 0.05541434
[5]        {butter} =>       {whole milk} 0.02755465  0.4972477 0.05541434
[6] {domestic eggs} => {other vegetables} 0.02226741  0.3509615 0.06344687
        lift count
[1] 1.000000  2513
[2] 1.666829   169
[3] 1.919481   257
[4] 1.868122   197
[5] 1.946053   271
[6] 1.813824   219

It is possible for redundant rules to be produced. To find all such rules, we can use the is.subset() function from the arules package.


rules_3.sorted <- sort(rules_3, by = "lift") # Sort by lift
# Identify redundant rules
subset.matrix <- is.subset(rules_3.sorted,rules_3.sorted) 
subset.matrix[lower.tri(subset.matrix,diag=T)] <- F 
redundant <- colSums(subset.matrix) >= 1
# Which rules are redundant?
which(redundant)


named integer(0)


# Remove redundant rules
rules_3_pruned <- rules_3.sorted[!redundant] # Remove redundant rules
rules_3_pruned


set of 37 rules 

No rules were returned as redundant.

Bootstrapping

Applying association rule mining to a sample provides insight into the association rules and pairings present in a particular sample. However, in many analyses, we are more interested in generalizing to other samples. We can apply a bootstrapping approach to assess the reproducibility of the rules we uncovered. For this example, we will generate 1,000 bootstrapped samples by randomly sampling (with replacement) from the original dataset. Then, on each bootstrapped sample we will apply the Apriori algorithm. We investigate the association rules found in each sample and retain only those rules that appeared in 90% or more of the bootstrapped samples. Association rule strength metrics (confidence, lift, and support) can then be computed on each bootstrapped sample. From this, we can calculate the mean and 95% confidence interval for these metrics.

Bootstrapping the apriori() function

To bootstrap the apriori() function, we can set up a function called boot_apriori(). This function will accept seven arguments: data (dataset), rhs (consequent), lhs (antecedent), confidence (minimum confidence), minlen (minimum rule length), maxlen (maximum rule length), and supp (minimum support value). First, the function will produce a bootstrapped sample using an embedded function called bootstrap_sample(). This function will sample with replacement and create a new sample with only those cases. Using the data produced by resampling with replacement, the apriori() function is then applied. The rules produced by the Apriori algorithm will be assessed for redundancy and the final pruned set of rules will be returned.


boot_apriori <- function(data, rhs, lhs, confidence, minlen, maxlen, supp){
  # Create bootstrap samples using embedded function
  bootstrap_sample <- function(data) { # This function only uses the given data
    sampled_ids <- sample(seq_along(data), replace = TRUE) # Sample with replacement
    sampled_transactions <- data[sampled_ids] # Create new data with only those cases
    return(sampled_transactions)
  }
  
  # Generate a bootstrap sample of the data
  boot_data <- bootstrap_sample(data = data)

  # Adjusting parameters
  boot_rules_eh <- apriori(boot_data,
                      parameter = list(confidence = confidence, minlen=minlen,
                                       maxlen = maxlen, supp = supp), 
                       appearance = list(lhs=lhs, # antecedent (if)
                                        rhs=rhs)) # consequent (then)
  
  ## find redundant rules 
  boot_rules_eh.sorted <- sort(boot_rules_eh, by = "lift")
  boot_subset.matrix<-is.subset(boot_rules_eh.sorted,boot_rules_eh.sorted) 
  boot_subset.matrix[lower.tri(boot_subset.matrix,diag=T)]<-F
  if(length(boot_subset.matrix) > 0){
    boot_redundant<-colSums(boot_subset.matrix)>=1
    boot_rules_eh.pruned<-boot_rules_eh.sorted[!boot_redundant]
  } else {
    boot_rules_eh.pruned <- boot_rules_eh
  }
  return(boot_rules_eh.pruned)
}

We could apply this function 1,000 times using a process such as a for() loop or the replicate() function. However, this could take a very long time to run. Let’s see how long one iteration takes to run using the tic() and toc() functions from the tictoc package (v1.2.1; Izrailev, 2023).


# Load the tictoc package
library(tictoc)


tic() # Start the timer
set.seed(15) # Set seed
test <- boot_apriori(Groceries, 
                     # Consequent
                    rhs = c("whole milk", "other vegetables"),
                   # Antecedent
                   lhs = c("yogurt", "whipped/sour cream", "tropical fruit", "root vegetables",
                           "citrus fruit", "domestic eggs", "pip fruit", "butter", "curd", "rolls/buns"),
                   supp = .01, # Minimum Support value
                   conf = .2, # Minimum Confidence value
                   minlen = 1, # Minimum rule length
                   maxlen = 20) # maximum rule length


Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.2    0.1    1 none FALSE            TRUE       5    0.01      1
 maxlen target  ext
     20  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 98 

set item appearances ...[12 item(s)] done [0.00s].
set transactions ...[12 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [12 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [37 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


toc() # Stop current timer


0.03 sec elapsed


test_summary <- head(inspect(test))


test_summary


                                  lhs                   rhs    support
[1] {tropical fruit, root vegetables} => {other vegetables} 0.01148958
[2]      {yogurt, whipped/sour cream} => {other vegetables} 0.01057448
[3]         {root vegetables, yogurt} => {other vegetables} 0.01220132
[4]                  {butter, yogurt} =>       {whole milk} 0.01016777
[5]     {root vegetables, rolls/buns} => {other vegetables} 0.01016777
[6]          {tropical fruit, yogurt} => {other vegetables} 0.01443823
    confidence   coverage     lift count
[1]  0.5765306 0.01992883 3.000094   113
[2]  0.5200000 0.02033554 2.705926   104
[3]  0.5150215 0.02369090 2.680019   120
[4]  0.6451613 0.01576004 2.490252   100
[5]  0.4784689 0.02125064 2.489810   100
[6]  0.4565916 0.03162176 2.375968   142

In this test we found 37 rules. This process did not take very long; however, to do this 1,000 times would take much longer. To speed this process up, let’s run this function using parallel processing.

Parallel processing

Parallel processing allows for a large task to be split into smaller tasks efficiently distributed throughout the CPUs of a machine. The parallel package is included with R and contains a set of functions that can easily parallelize most processes. To parallelize this process, we use the clusterExport() function. In order to use this function, we need to save all information that will be necessary for the boot_apriori() function into the Global Environment. This includes all arguments being passed to boot_apriori(). Then, we will define the number of clusters to use with the makeCluster() and the detectCores() functions. Finally, we can use the parLapply() function to run the boot_apriori() function 1,000 times and produce a list called models_1.


# Load the parallel package
library(parallel)
tic()
nreps <- 1000 # Repeat the process 1000 times
# Save all arguments as objects
rhs = c("whole milk", "other vegetables")
lhs = c("yogurt", "whipped/sour cream", "tropical fruit", "root vegetables",
         "citrus fruit", "domestic eggs", "pip fruit", "butter", "curd", "rolls/buns") 
confidence = .05
minlen = 1
maxlen = 20
supp = .01
set.seed(55)
# Define the number of clusters
cl <- makeCluster(detectCores() - 1)
# Export objects
clusterExport(cl, varlist = c("Groceries", "rhs", "lhs", "supp","boot_apriori","nreps",
                              "confidence","minlen","maxlen"))
models_1 <- parLapply(cl, 1:nreps, function(x) 
  boot_apriori(Groceries, rhs, lhs, confidence, minlen, maxlen, supp))

# Stop the cluster
stopCluster(cl)
toc()


2.783 sec elapsed

By bootstrapping association rule mining, we can gain insight into which rules would replicate over and over again if we were to keep resampling from the population. This gives a better idea of whether or not a rule is an association that happens to appear in the given data, or is one that actually exists in the population. To investigate that in models_1, we can look to see which rules exist in 90% or more of the bootstrapped samples.


library(dplyr)

# Organize the list of models produced by  boost_apriori into a data.frame
rules <- lapply(models_1, inspect)


# Organize into data.frame
results <- do.call(rbind, rules)

# Count how many samples a rule appeared in
results_count <- results %>%
  select(lhs,rhs) %>%
  group_by(lhs, rhs) %>%
  summarise(boot = n()) %>%
  ungroup()

results <- merge(results, results_count)

# Remove bootstrapped samples with less than 90% of the rules reproduced
results_final <- results[results$boot > .9*nreps,]

table(results_final$lhs, results_final$rhs)


                                   
                                    {other vegetables} {whole milk}
  {}                                               921          921
  {butter}                                         921          921
  {citrus fruit}                                   921          921
  {curd}                                           921          921
  {domestic eggs}                                  921          921
  {pip fruit}                                      921          921
  {rolls/buns}                                     921          921
  {root vegetables, rolls/buns}                      0          913
  {root vegetables, yogurt}                        911          921
  {root vegetables}                                921          921
  {tropical fruit, root vegetables}                903            0
  {tropical fruit, yogurt}                         902          921
  {tropical fruit}                                 921          921
  {whipped/sour cream}                             921          921
  {yogurt, rolls/buns}                               0          910
  {yogurt}                                         921          921

These are the rules that appeared in more than 90% of the bootstrapped samples. The first row indicates two rules. Since the antecedent is an empty set, we can interpret this as “no matter what other items were bought, other vegetables were bought” and “no matter what other items were bought, whole milk was bought.” To avoid getting rules with an empty antecedent set, simply set the minimum rule length to 2. For the next row, two more rules are given: “if butter was bought, then other vegetables were bought” and “if butter was bought, then whole milk was bought.”

Since we have bootstrapped samples, we can also calculate the mean and 95% confidence intervals for any metric of interest. In the below example, we will do this for confidence, but the variable can be changed to look at the same parameters for lift or support.


library(tidyr) # unnest.wider()
library(Hmisc) # for smean.cl.normal()

results_final %>%
  group_by(rhs, lhs) %>% 
  summarise(
    N = n(), 
    ci = list(smean.cl.normal(confidence))
  ) %>% 
  unnest_wider(ci) 


# A tibble: 29 × 6
# Groups:   rhs [2]
   rhs                lhs                               N  Mean Lower Upper
   <chr>              <chr>                         <int> <dbl> <dbl> <dbl>
 1 {other vegetables} {butter}                        921 0.361 0.360 0.362
 2 {other vegetables} {citrus fruit}                  921 0.349 0.348 0.350
 3 {other vegetables} {curd}                          921 0.322 0.321 0.324
 4 {other vegetables} {domestic eggs}                 921 0.351 0.349 0.352
 5 {other vegetables} {pip fruit}                     921 0.345 0.344 0.346
 6 {other vegetables} {rolls/buns}                    921 0.232 0.231 0.232
 7 {other vegetables} {root vegetables, rolls/buns}   904 0.502 0.500 0.505
 8 {other vegetables} {root vegetables, yogurt}       914 0.500 0.498 0.502
 9 {other vegetables} {root vegetables}               921 0.434 0.433 0.435
10 {other vegetables} {tropical fruit, yogurt}        904 0.420 0.418 0.421
# ℹ 19 more rows

Here only the first 10 rows are displayed. To see all rows simply add %>% print(n = Inf). For the first rule, “if butter was bought, then so were other vegetables,” we can interpret the mean confidence value as the percentage of transactions that supported this rule. The confidence intervals can be interpreted as follows: We are 95% confident that the true percentage of transactions supporting this rule falls between 36.0% and 36.2%.


References

  • Hahsler, M., Buchta, C., Gruen, B., & Hornik, K. (2023). arules: Mining Association Rules and Frequent Itemsets. R package version 1.7-6, https://CRAN.R-project.org/package=arules
  • Izrailev, S. (2023). tictoc: Functions for Timing R Scripts, as Well as Implementations of “Stack” and “StackList” Structures. R package version 1.2. https://CRAN.R-project.org/package=tictoc
  • R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

Laura Jamison
StatLab Associate
University of Virginia Library
May 20, 2024


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.