Track: Data Mining: Algorithms
Efficient Similarity Joins for Near Duplicate Detection
- Chuan Xiao(University of New South Wales)
- Wei Wang(University of New South Wales)
- Xuemin Lin(University of New South Wales)
- Jeffrey Xu Yu(Chinese University of Hong Kong)
With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. Experimental results show that our proposed algorithms can achieve up to 2.6x5x speed-up over previous algorithms on several real datasets and provide alternative solutions to the near duplicate Web page detection problem.
Inquiries can be sent to: