Proceedings of the 2010 SIAM International Conference on Data Mining

Mining Top-K Patterns from Binary Datasets in presence of Noise


The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only.
In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data.
We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and real-world data.

Published In

Proceedings of the 2010 SIAM International Conference on Data Mining
Pages: 165 - 176
Editors: Srinivasan Parthasarathy, The Ohio State University, Columbus, Ohio, Bing Liu, University of Illinois – Chicago, Chicago, Illinois, Bart Goethals, University of Antwerp, Antwerpen, Belgium, Jian Pei, Simon Fraser University, Burnaby, British Columbia, Canada, and Chandrika Kamath, Lawrence Livermore National Laboratory, Livermore, California
ISBN (Print): 978-0-898717-03-7
ISBN (Online): 978-1-61197-280-1


Published online: 18 December 2013



