smile.association.package-info Maven / Gradle / Ivy
/*
* Copyright (c) 2010-2021 Haifeng Li. All rights reserved.
*
* Smile is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Smile is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Smile. If not, see .
*/
/**
* Frequent item set mining and association rule mining.
* Association rule learning is a popular and well researched method for
* discovering interesting relations between variables in large databases.
* Let I = {i1, i2,..., in} be a set of n
* binary attributes called items. Let D = {t1, t2,..., tm}
* be a set of transactions called the database. Each transaction in D has a
* unique transaction ID and contains a subset of the items in I.
* An association rule is defined as an implication of the form X ⇒ Y
* where X, Y ⊆ I and X ∩ Y = Ø. The item sets X and Y are called
* antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS)
* of the rule, respectively. The support supp(X) of an item set X is defined as
* the proportion of transactions in the database which contain the item set.
* Note that the support of an association rule X ⇒ Y is supp(X ∪ Y).
* The confidence of a rule is defined conf(X ⇒ Y) = supp(X ∪ Y) / supp(X).
* Confidence can be interpreted as an estimate of the probability P(Y | X),
* the probability of finding the RHS of the rule in transactions under the
* condition that these transactions also contain the LHS.
*
* For example, the rule {onions, potatoes} ⇒ {burger} found in the sales
* data of a supermarket would indicate that if a customer buys onions and
* potatoes together, he or she is likely to also buy burger. Such information
* can be used as the basis for decisions about marketing activities such as
* promotional pricing or product placements.
*
* Association rules are usually required to satisfy a user-specified minimum
* support and a user-specified minimum confidence at the same time. Association
* rule generation is usually split up into two separate steps:
*
* - First, minimum support is applied to find all frequent item sets
* in a database (i.e. frequent item set mining).
*
- Second, these frequent item sets and the minimum confidence constraint
* are used to form rules.
*
*
* Finding all frequent item sets in a database is difficult since it involves
* searching all possible item sets (item combinations). The set of possible
* item sets is the power set over I (the set of items) and has size 2n - 1
* (excluding the empty set which is not a valid item set). Although the size
* of the power set grows exponentially in the number of items n in I, efficient
* search is possible using the downward-closure property of support
* (also called anti-monotonicity) which guarantees that for a frequent item set
* also all its subsets are frequent and thus for an infrequent item set, all
* its supersets must be infrequent.
*
* In practice, we may only consider the frequent item set that has the maximum
* number of items bypassing all the sub item sets. An item set is maximal
* frequent if none of its immediate supersets is frequent.
*
* For a maximal frequent item set, even though we know that all the sub item
* sets are frequent, we don't know the actual support of those sub item sets,
* which are very important to find the association rules within the item sets.
* If the final goal is association rule mining, we would like to discover
* closed frequent item sets. An item set is closed if none of its immediate
* supersets has the same support as the item set.
*
* Some well known algorithms of frequent item set mining are Apriori,
* Eclat and FP-Growth. Apriori is the best-known algorithm to mine association
* rules. It uses a breadth-first search strategy to counting the support of
* item sets and uses a candidate generation function which exploits the downward
* closure property of support. Eclat is a depth-first search algorithm using
* set intersection.
*
* FP-growth (frequent pattern growth) uses an extended prefix-tree (FP-tree)
* structure to store the database in a compressed form. FP-growth adopts a
* divide-and-conquer approach to decompose both the mining tasks and the
* databases. It uses a pattern fragment growth method to avoid the costly
* process of candidate generation and testing used by Apriori.
*
*
References
*
* - R. Agrawal, T. Imielinski and A. Swami. Mining Association Rules Between Sets of Items in Large Databases, SIGMOD, 207-216, 1993.
* - Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. VLDB, 487-499, 1994.
* - Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390, 2000.
* - Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery 8:53-87, 2004.
*
*
* @author Haifeng Li
*/
package smile.association;