Almost Optimal Algorithms for Detecting Near-Duplicates in Domain-Independent Big Data

In this chapter, we propose Merge-Filter Representative-based Clustering (Merge-Filter-RC), a general domain-independent method for finding near-duplicate records within and across different data sources. Following that, we develop three nearly optimal classes of algorithms known as All-Three algorithms: constant threshold (CT), variable threshold (VT), and function threshold (FT). Merge-Filter-RC and All-Three form the backbone of this…
March 12, 2022