Instead of doing a simulation, I need a dataset with hundreds or thousands of records distributed over multiple data sources. I need it for data integration purposes where the same entity may have multiple records in different data sources.
Very good question. I'm looking for such datasets myself. I have tried making them, but the main problem is always: providing groundtruth. An indication of which records are duplicate is doable, but providing the best fusion result is really difficult. Anyway, the closest I have come, is the datasets you can find here
I can see some of the links provided still provide simulated datasets. I think it would as well be appropriate to simulate a dataset and may be add a Gaussian noise to make it a candidate suspect of a real life sitaution, then go a head to test any algorithms you wish to test. You may minimise doubt of your results by carrying out a consistence analysis. I believe in that way you can disseminate your finds with some degree of confidence.
Hi Cliff, a problem with making such a simulated dataset is that there not so much known on how you should introduce "noise" in a realistic fashion. Most researchers agree that the size of a cluster of duplicates is Zipf distributed and some have proposed some models for introducing typographical errors, but apart from that, it would be "guessing" what a realistic error model would be. For example, abbreviations, multi-valued attributes, subjectiveness (e.g. the musical genre of a CD)... And of course, if there is no realistic error model, simulating a dataset tends to be biased to work good on the algorithm you want to test.
I mean by powerful: more trusted datasets that have multiple records, attributes and a range of distinct values within each attribute with a good percentage of duplicates between datasets. In other words, datasets that could be used in duplicate detection between records.
Antoon, I cant disagree with you at all, however,the difficulty should not lie in how we introduce the error, but the size of the error we ought to introduce. And may be in line with what you have pointed out, the distribution of that error, if not Gaussian. Your answer draws me to the fact that, we statisticians are always faced with challenges on certain facts, like estimation, forecasting and all that, those facts can never be possible without, making strong assumptions about the used datasets. In that regard, I believe, we can do with the noised simulated dataset, till a "better" dataset is obtained, if it will ever be gotten anyway!