Abstract
In data mining, Duplicate detection is an important step in data integration and most state-of-the-art methods. Existing, record linkage techniques, SVM, OSVM, PEBL, Christen are record matching methods. According to the Web database scenario, records to match are greatly query-dependent, a pertained approach is not applicable as the set of records in each query’s results is a biased subset of the full data set. Even if applicable for each new query, depending on the results returned, the field weights should probably change too, which makes supervised-learning based methods even less applicable. In this paper, an unsupervised, online approach, UDD, was proposed for detecting duplicates over the query results of multiple Web databases along with similarity metric called simstring. For String Similarity calculation UDD uses any kind of similarity calculation method. Here, Simstring is a similarity or relevance metric which is used for efficient web mining is used. Two classifiers, WCSS and SVM, are used cooperatively in the convergence step of record matching to identify the duplicate pairs from all potential duplicate pairs iteratively.