Mining top-k high utility itemsets

CW Wu, BE Shie, VS Tseng, PS Yu - Proceedings of the 18th ACM …, 2012 - dl.acm.org
CW Wu, BE Shie, VS Tseng, PS Yu
Proceedings of the 18th ACM SIGKDD international conference on Knowledge …, 2012dl.acm.org
Mining high utility itemsets from databases is an emerging topic in data mining, which refers
to the discovery of itemsets with utilities higher than a user-specified minimum utility
threshold min_util. Although several studies have been carried out on this topic, setting an
appropriate minimum utility threshold is a difficult problem for users. If min_util is set too low,
too many high utility itemsets will be generated, which may cause the mining algorithms to
become inefficient or even run out of memory. On the other hand, if min_util is set too high …
Mining high utility itemsets from databases is an emerging topic in data mining, which refers to the discovery of itemsets with utilities higher than a user-specified minimum utility threshold min_util. Although several studies have been carried out on this topic, setting an appropriate minimum utility threshold is a difficult problem for users. If min_util is set too low, too many high utility itemsets will be generated, which may cause the mining algorithms to become inefficient or even run out of memory. On the other hand, if min_util is set too high, no high utility itemset will be found. Setting appropriate minimum utility thresholds by trial and error is a tedious process for users. In this paper, we address this problem by proposing a new framework named top-k high utility itemset mining, where k is the desired number of high utility itemsets to be mined. An efficient algorithm named TKU (Top-K Utility itemsets mining) is proposed for mining such itemsets without setting min_util. Several features were designed in TKU to solve the new challenges raised in this problem, like the absence of anti-monotone property and the requirement of lossless results. Moreover, TKU incorporates several novel strategies for pruning the search space to achieve high efficiency. Results on real and synthetic datasets show that TKU has excellent performance and scalability.
ACM Digital Library