Association rule mining is a machine learning technique designed to uncover associations between categorical variables in data. It produces association rules which reflect how frequently categories are found together. For instance, a common application of association rule mining is the “Frequently Bought Together” section of the checkout page from an online store. Through mining across transactions from that store, items that are frequently bought together have been identified (e.g., shaving cream and razors).
Association rules are structured as “if-then” statements. For instance, if a customer bought shaving cream, then they also bought razors. The “if-then” portions of association rules are often referred to as the antecedent (“if”) and the consequent (“then”). It is worth noting that association rules merely denote co-occurrence, not causal patterns. Based on the results of association rule mining, we cannot determine the cause of buying razors, merely that it is associated with purchasing shaving cream.
There are several algorithms available for association rule mining. We will focus on the Apriori algorithm for this article as it is one of the most common. The Apriori algorithm will search the data for the most frequently occurring sets of items. Typically, the antecedent can contain any number of items while the consequent will contain only one item. To evaluate the importance of each rule, several common metrics can be evaluated. For this article we will focus on only three: support, confidence, and lift. Support is defined as how often this item set appears in a dataset. Confidence represents the proportion of rules containing both the antecedent and the consequent. Lift represents how much more likely the consequent is to occur when the antecent is present as compared to when it is absent; it is the ratio of the confidence of a rule to the frequency of the consequent in the whole dataset.
Association rule mining in R
To conduct association rule mining in R (v4.4.0; R Core Team 2023), we will use the apriori()
function from the arules package (v1.7-7; Hahsler et al., 2023). Within the arules package there is a dataset called Groceries
. This dataset represents transactions at a grocery store for 1 month (30 days) from a real world grocery store. There are 9,835 transactions across 169 categories. First, we will load the arules package. Then, we will load in the Groceries
dataset. If you don’t already have the arules package, you can install it by running install.packages(“arules”)
in the console.
library(arules) # Load the arules package
data(Groceries) # Read Groceries into Global Environment
Let’s take a look at the Groceries
dataset. To inspect the dataset, let’s first call the class()
function to show us the class of the object.
class(Groceries) # Inspect class of Groceries dataset
[1] "transactions"
attr(,"package")
[1] "arules"
The Groceries
dataset is of class “transactions”. This class of object is specific to the arules package. The functions in the arules package used to conduct association rule mining specifically take objects of class “transactions”. To convert an object to the class “transactions”, the transactions()
function can be used. Objects of this class have a unique structure. First, these objects are S4 objects, meaning that to call anything contained within Groceries
we need to use the “@” symbol. In order to look at the items contained in Groceries
, we can call the itemInfo
object by running Groceries@itemInfo
. This would print a data frame of all the items and their information contained in the Groceries
object. Since it’s a data frame, we can wrap this object in head()
to look at the first six rows.
head(Groceries@itemInfo)
labels level2 level1
1 frankfurter sausage meat and sausage
2 sausage sausage meat and sausage
3 liver loaf sausage meat and sausage
4 ham sausage meat and sausage
5 meat sausage meat and sausage
6 finished products sausage meat and sausage
This shows us that there are three variables called labels
, level2
, and level1.
The labels
variable contains all of the items; level2
shows what category those items are in, and level1
shows the category that level2
is in.
The apriori()
function in the arules package will allow us to apply the Apriori algorithm to these data. This function defaults to mining only rules with a minimum support value of 0.10, a minimum confidence value of 0.80, and a maximum of 10 items in the transactions. In order to prevent the function from running for too long, it will also time out checking for subsets after 5 seconds. Transaction lengths refer to the entire itemset, including both the antecedent and the consequent. This function defaults to a minimum transaction or itemset length of 1, meaning that a rule could be returned containing only the antecedent. Let’s run the first model using these preset values and save the model as an object called rules_1
.
rules_1 <- apriori(Groceries)
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.8 0.1 1 none FALSE TRUE 5 0.1 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 983
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [8 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
rules_1
set of 0 rules
Using the default setting, this returns 0 relevant rules. Let’s adjust some of the default settings and see how this affects the model. Within the apriori()
function we will pass a named list to the argument called parameter
. In this list, we will adjust minimum support values with an argument called supp
, minimum confidence values with an argument called conf
, and the maximum length of rules with an argument called maxlen
. We will save this model to an object called rules_2
.
rules_2 <- apriori(Groceries, parameter = list(supp = .01, # Minimum Support value
conf = .5, # Minimum Confidence value
maxlen = 20)) # Maximum rule length
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.5 0.1 1 none FALSE TRUE 5 0.01 1
maxlen target ext
20 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 98
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [88 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [15 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
length(rules_2)
[1] 15
This returns a set of 15 rules. We can use the inspect()
function to return the rules listed in this object.
rules_2_summary <- head(inspect(rules_2))
rules_2_summary
lhs rhs support
[1] {curd, yogurt} => {whole milk} 0.01006609
[2] {other vegetables, butter} => {whole milk} 0.01148958
[3] {other vegetables, domestic eggs} => {whole milk} 0.01230300
[4] {yogurt, whipped/sour cream} => {whole milk} 0.01087951
[5] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
[6] {pip fruit, other vegetables} => {whole milk} 0.01352313
confidence coverage lift count
[1] 0.5823529 0.01728521 2.279125 99
[2] 0.5736041 0.02003050 2.244885 113
[3] 0.5525114 0.02226741 2.162336 121
[4] 0.5245098 0.02074225 2.052747 107
[5] 0.5070423 0.02887646 1.984385 144
[6] 0.5175097 0.02613116 2.025351 133
In the output of rules_2
there are columns labeled rhs
(right hand side) and lhs
(left hand side). These refer to the antecedent (lhs
) and consequent (rhs
). The first rule can be read as “if curd and yogurt were purchased, then so was whole milk.”
The way we specified the apriori()
function for rules_2
allowed it to search through all possible antecedent and consequent combinations. If we want to, we can also define which items we would like to have in the antecedent and which items should be in the consequent. A list of possible values can be given for consideration in the antecedent. The algorithm will search through possible combinations, ranging from no items to all items. A list of possible values will also be given for consideration in the consequent, however only one item can be present in the consequent within a single rule. The algorithm will produce a set of rules in which the consequents will contain one of the items from the list provided and the antecedents will contain any combination of items from the list provided. In the rules_3
object, we will use the appearance
argument of the apriori()
function to set these values. The appearance
argument takes a list of rhs
and lhs
values.
rules_3 <- apriori(Groceries,
# Consequent
appearance = list(rhs = c("whole milk", "other vegetables"),
# Antecedent
lhs = c("yogurt", "whipped/sour cream", "tropical fruit", "root vegetables",
"citrus fruit", "domestic eggs", "pip fruit", "butter", "curd", "rolls/buns")),
parameter = list(supp = .01, # Minimum Support value
conf = .2, # Minimum Confidence value
maxlen = 20)) # Maximum rule length
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.2 0.1 1 none FALSE TRUE 5 0.01 1
maxlen target ext
20 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 98
set item appearances ...[12 item(s)] done [0.00s].
set transactions ...[12 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [12 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [37 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
rules_3_summary <- head(inspect(rules_3))
rules_3_summary
lhs rhs support confidence coverage
[1] {} => {whole milk} 0.25551601 0.2555160 1.00000000
[2] {curd} => {other vegetables} 0.01718353 0.3225191 0.05327911
[3] {curd} => {whole milk} 0.02613116 0.4904580 0.05327911
[4] {butter} => {other vegetables} 0.02003050 0.3614679 0.05541434
[5] {butter} => {whole milk} 0.02755465 0.4972477 0.05541434
[6] {domestic eggs} => {other vegetables} 0.02226741 0.3509615 0.06344687
lift count
[1] 1.000000 2513
[2] 1.666829 169
[3] 1.919481 257
[4] 1.868122 197
[5] 1.946053 271
[6] 1.813824 219
It is possible for redundant rules to be produced. To find all such rules, we can use the is.subset()
function from the arules package.
rules_3.sorted <- sort(rules_3, by = "lift") # Sort by lift
# Identify redundant rules
subset.matrix <- is.subset(rules_3.sorted,rules_3.sorted)
subset.matrix[lower.tri(subset.matrix,diag=T)] <- F
redundant <- colSums(subset.matrix) >= 1
# Which rules are redundant?
which(redundant)
named integer(0)
# Remove redundant rules
rules_3_pruned <- rules_3.sorted[!redundant] # Remove redundant rules
rules_3_pruned
set of 37 rules
No rules were returned as redundant.
Bootstrapping
Applying association rule mining to a sample provides insight into the association rules and pairings present in a particular sample. However, in many analyses, we are more interested in generalizing to other samples. We can apply a bootstrapping approach to assess the reproducibility of the rules we uncovered. For this example, we will generate 1,000 bootstrapped samples by randomly sampling (with replacement) from the original dataset. Then, on each bootstrapped sample we will apply the Apriori algorithm. We investigate the association rules found in each sample and retain only those rules that appeared in 90% or more of the bootstrapped samples. Association rule strength metrics (confidence, lift, and support) can then be computed on each bootstrapped sample. From this, we can calculate the mean and 95% confidence interval for these metrics.
Bootstrapping the apriori()
function
To bootstrap the apriori()
function, we can set up a function called boot_apriori()
. This function will accept seven arguments: data
(dataset), rhs
(consequent), lhs
(antecedent), confidence
(minimum confidence), minlen
(minimum rule length), maxlen
(maximum rule length), and supp
(minimum support value). First, the function will produce a bootstrapped sample using an embedded function called bootstrap_sample()
. This function will sample with replacement and create a new sample with only those cases. Using the data produced by resampling with replacement, the apriori()
function is then applied. The rules produced by the Apriori algorithm will be assessed for redundancy and the final pruned set of rules will be returned.
boot_apriori <- function(data, rhs, lhs, confidence, minlen, maxlen, supp){
# Create bootstrap samples using embedded function
bootstrap_sample <- function(data) { # This function only uses the given data
sampled_ids <- sample(seq_along(data), replace = TRUE) # Sample with replacement
sampled_transactions <- data[sampled_ids] # Create new data with only those cases
return(sampled_transactions)
}
# Generate a bootstrap sample of the data
boot_data <- bootstrap_sample(data = data)
# Adjusting parameters
boot_rules_eh <- apriori(boot_data,
parameter = list(confidence = confidence, minlen=minlen,
maxlen = maxlen, supp = supp),
appearance = list(lhs=lhs, # antecedent (if)
rhs=rhs)) # consequent (then)
## find redundant rules
boot_rules_eh.sorted <- sort(boot_rules_eh, by = "lift")
boot_subset.matrix<-is.subset(boot_rules_eh.sorted,boot_rules_eh.sorted)
boot_subset.matrix[lower.tri(boot_subset.matrix,diag=T)]<-F
if(length(boot_subset.matrix) > 0){
boot_redundant<-colSums(boot_subset.matrix)>=1
boot_rules_eh.pruned<-boot_rules_eh.sorted[!boot_redundant]
} else {
boot_rules_eh.pruned <- boot_rules_eh
}
return(boot_rules_eh.pruned)
}
We could apply this function 1,000 times using a process such as a for()
loop or the replicate()
function. However, this could take a very long time to run. Let’s see how long one iteration takes to run using the tic()
and toc()
functions from the tictoc package (v1.2.1; Izrailev, 2023).
# Load the tictoc package
library(tictoc)
tic() # Start the timer
set.seed(15) # Set seed
test <- boot_apriori(Groceries,
# Consequent
rhs = c("whole milk", "other vegetables"),
# Antecedent
lhs = c("yogurt", "whipped/sour cream", "tropical fruit", "root vegetables",
"citrus fruit", "domestic eggs", "pip fruit", "butter", "curd", "rolls/buns"),
supp = .01, # Minimum Support value
conf = .2, # Minimum Confidence value
minlen = 1, # Minimum rule length
maxlen = 20) # maximum rule length
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.2 0.1 1 none FALSE TRUE 5 0.01 1
maxlen target ext
20 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 98
set item appearances ...[12 item(s)] done [0.00s].
set transactions ...[12 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [12 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [37 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
toc() # Stop current timer
0.03 sec elapsed
test_summary <- head(inspect(test))
test_summary
lhs rhs support
[1] {tropical fruit, root vegetables} => {other vegetables} 0.01148958
[2] {yogurt, whipped/sour cream} => {other vegetables} 0.01057448
[3] {root vegetables, yogurt} => {other vegetables} 0.01220132
[4] {butter, yogurt} => {whole milk} 0.01016777
[5] {root vegetables, rolls/buns} => {other vegetables} 0.01016777
[6] {tropical fruit, yogurt} => {other vegetables} 0.01443823
confidence coverage lift count
[1] 0.5765306 0.01992883 3.000094 113
[2] 0.5200000 0.02033554 2.705926 104
[3] 0.5150215 0.02369090 2.680019 120
[4] 0.6451613 0.01576004 2.490252 100
[5] 0.4784689 0.02125064 2.489810 100
[6] 0.4565916 0.03162176 2.375968 142
In this test we found 37 rules. This process did not take very long; however, to do this 1,000 times would take much longer. To speed this process up, let’s run this function using parallel processing.
Parallel processing
Parallel processing allows for a large task to be split into smaller tasks efficiently distributed throughout the CPUs of a machine. The parallel package is included with R and contains a set of functions that can easily parallelize most processes. To parallelize this process, we use the clusterExport()
function. In order to use this function, we need to save all information that will be necessary for the boot_apriori()
function into the Global Environment. This includes all arguments being passed to boot_apriori()
. Then, we will define the number of clusters to use with the makeCluster()
and the detectCores()
functions. Finally, we can use the parLapply()
function to run the boot_apriori()
function 1,000 times and produce a list called models_1
.
# Load the parallel package
library(parallel)
tic()
nreps <- 1000 # Repeat the process 1000 times
# Save all arguments as objects
rhs = c("whole milk", "other vegetables")
lhs = c("yogurt", "whipped/sour cream", "tropical fruit", "root vegetables",
"citrus fruit", "domestic eggs", "pip fruit", "butter", "curd", "rolls/buns")
confidence = .05
minlen = 1
maxlen = 20
supp = .01
set.seed(55)
# Define the number of clusters
cl <- makeCluster(detectCores() - 1)
# Export objects
clusterExport(cl, varlist = c("Groceries", "rhs", "lhs", "supp","boot_apriori","nreps",
"confidence","minlen","maxlen"))
models_1 <- parLapply(cl, 1:nreps, function(x)
boot_apriori(Groceries, rhs, lhs, confidence, minlen, maxlen, supp))
# Stop the cluster
stopCluster(cl)
toc()
2.783 sec elapsed
By bootstrapping association rule mining, we can gain insight into which rules would replicate over and over again if we were to keep resampling from the population. This gives a better idea of whether or not a rule is an association that happens to appear in the given data, or is one that actually exists in the population. To investigate that in models_1
, we can look to see which rules exist in 90% or more of the bootstrapped samples.
library(dplyr)
# Organize the list of models produced by boost_apriori into a data.frame
rules <- lapply(models_1, inspect)
# Organize into data.frame
results <- do.call(rbind, rules)
# Count how many samples a rule appeared in
results_count <- results %>%
select(lhs,rhs) %>%
group_by(lhs, rhs) %>%
summarise(boot = n()) %>%
ungroup()
results <- merge(results, results_count)
# Remove bootstrapped samples with less than 90% of the rules reproduced
results_final <- results[results$boot > .9*nreps,]
table(results_final$lhs, results_final$rhs)
{other vegetables} {whole milk}
{} 921 921
{butter} 921 921
{citrus fruit} 921 921
{curd} 921 921
{domestic eggs} 921 921
{pip fruit} 921 921
{rolls/buns} 921 921
{root vegetables, rolls/buns} 0 913
{root vegetables, yogurt} 911 921
{root vegetables} 921 921
{tropical fruit, root vegetables} 903 0
{tropical fruit, yogurt} 902 921
{tropical fruit} 921 921
{whipped/sour cream} 921 921
{yogurt, rolls/buns} 0 910
{yogurt} 921 921
These are the rules that appeared in more than 90% of the bootstrapped samples. The first row indicates two rules. Since the antecedent is an empty set, we can interpret this as “no matter what other items were bought, other vegetables were bought” and “no matter what other items were bought, whole milk was bought.” To avoid getting rules with an empty antecedent set, simply set the minimum rule length to 2. For the next row, two more rules are given: “if butter was bought, then other vegetables were bought” and “if butter was bought, then whole milk was bought.”
Since we have bootstrapped samples, we can also calculate the mean and 95% confidence intervals for any metric of interest. In the below example, we will do this for confidence, but the variable can be changed to look at the same parameters for lift or support.
library(tidyr) # unnest.wider()
library(Hmisc) # for smean.cl.normal()
results_final %>%
group_by(rhs, lhs) %>%
summarise(
N = n(),
ci = list(smean.cl.normal(confidence))
) %>%
unnest_wider(ci)
# A tibble: 29 × 6
# Groups: rhs [2]
rhs lhs N Mean Lower Upper
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 {other vegetables} {butter} 921 0.361 0.360 0.362
2 {other vegetables} {citrus fruit} 921 0.349 0.348 0.350
3 {other vegetables} {curd} 921 0.322 0.321 0.324
4 {other vegetables} {domestic eggs} 921 0.351 0.349 0.352
5 {other vegetables} {pip fruit} 921 0.345 0.344 0.346
6 {other vegetables} {rolls/buns} 921 0.232 0.231 0.232
7 {other vegetables} {root vegetables, rolls/buns} 904 0.502 0.500 0.505
8 {other vegetables} {root vegetables, yogurt} 914 0.500 0.498 0.502
9 {other vegetables} {root vegetables} 921 0.434 0.433 0.435
10 {other vegetables} {tropical fruit, yogurt} 904 0.420 0.418 0.421
# ℹ 19 more rows
Here only the first 10 rows are displayed. To see all rows simply add %>% print(n = Inf)
. For the first rule, “if butter was bought, then so were other vegetables,” we can interpret the mean confidence value as the percentage of transactions that supported this rule. The confidence intervals can be interpreted as follows: We are 95% confident that the true percentage of transactions supporting this rule falls between 36.0% and 36.2%.
References
- Hahsler, M., Buchta, C., Gruen, B., & Hornik, K. (2023). arules: Mining Association Rules and Frequent Itemsets. R package version 1.7-6, https://CRAN.R-project.org/package=arules
- Izrailev, S. (2023). tictoc: Functions for Timing R Scripts, as Well as Implementations of “Stack” and “StackList” Structures. R package version 1.2. https://CRAN.R-project.org/package=tictoc
- R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Laura Jamison
StatLab Associate
University of Virginia Library
May 20, 2024
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.