Presently, there are two major approaches for duplicate record detection. Research in databases highlights relatively simple and quick duplicate detection techniques that can be applied to databases with millions of records. Such techniques typically do not rely on the existence of training data and emphasize efficiency over effectiveness. On the other hand, research in machine learning and statistics aims to develop more sophisticated matching techniques that rely on probabilistic models. An interesting way for future research is to develop techniques that combine the best of both worlds. Most of the duplicate detection systems available today recommend various algorithmic approaches for speeding up the duplicate detection process. The varying nature of the duplicate detection process also requires adaptive methods that detect different patterns for duplicate detection and automatically adapt themselves over time.
Finally, a huge amount of structured information is now derived from unstructured text and from the web. This information is typically inaccurate and noisy; duplicate record detection techniques are essential for improving the quality of the extracted data. The increasing popularity of information extraction techniques is going to make this issue more common in the future, highlighting the need to develop strong and scalable solutions. This only adds to the response that more research is needed in the area of duplicate record detection and in the area of data cleaning and information excellence in general. We conclude with coverage of existing tools and with a brief discussion of the problems in duplicate record detection.
Removal of duplicate entries or transformation of data into some suitable format comes under the data preprocessing methods (i.e., data cleaning methods).
There are several data cleaning methods. The existing available tools (such as WEKA and RapidMiner using filters) also allow to remove duplicate entries.
Data mining techniques, however, can be used for the interesting and valuable information extraction from the dataset.