
Prof. Dr. Felix Naumann
Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Str. 2-3
D-14482 Potsdam, Germany
Paper accepted at SSDBM
Proceedings of the 24th International Conference on Scientific and Statistical Database...
JWS Article Accepted
Integrating Open Government Data with Stratosphere for more Transparency Arvid Heise and Felix...
LREC Paper Accepted
The eighth international conference on Language Resources and Evaluation (LREC), Istanbul,...
Daniel Rinser wins award for his masters thesis
IQ Best Master Degree Wettbewerb der Deutschen Gesellschaft für Informations- und Datenqualität e....
HPI TV releases video about GovWILD
See the new video about our Government Data Integration platform GovWILD.
Tool voidGen released
As part of our winning submission at the 2010 Billion Triple Challenge at the International...
ICDE Paper Accepted
28th IEEE International Conference on Data Engineering (ICDE) Washington, DC, USA Adaptive...
Authors
Melanie Weis, Felix Naumann, Franziska Brosy
Description
We present a collection of generated test data on this page. The paper describes mechanism of evaluating XML duplicate detection algorithms with the help of several metrics. You can find additional work in the last section of this page.
Abstract
Duplicate detection, which is an important subtask of data cleaning, is the task of identifying multiple representations of a same real-world object. Numerous approaches both for relational and XML data exist. Their goals are either on improving the quality of the detected duplicates (effectiveness) or on saving computation time (efficiency). In particular for the first goal, the "goodness" of an approach is usually evaluated based on experimental studies. Although some methods and data sets have gained popularity, it is still difficult to compare different approaches or to assess the quality of one own´s approach. This difficulty of comparison is mainly due to lack of documentation of algorithms and the data, software and hardware used and/or limited resources not allowing to rebuild systems described by others.
In this paper, we propose a benchmark for duplicate detection, specialized to XML, which can be part of a broader duplicate detection or even data cleansing benchmark. We discuss all necessary elements to make up a benchmark: Data provisioning, clearly defined operations (the benchmark workload), and metrics to evaluate the quality. The proposed benchmark is a step forward to representative comparisons of duplicate detection algorithms. We note that this benchmark is yet to be implemented and this paper is meant to be a starting point for discussion. [more]
Test data
The different datasets, which we used for testing the algorithms, are listed below:
Related work from our Information Systems group
Here you find several scientific work which also deal with Duplicate Detection in XML data:


