
Prof. Dr. Felix Naumann
Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Str. 2-3
D-14482 Potsdam, Germany
Paper accepted at SSDBM
Proceedings of the 24th International Conference on Scientific and Statistical Database...
JWS Article Accepted
Integrating Open Government Data with Stratosphere for more Transparency Arvid Heise and Felix...
LREC Paper Accepted
The eighth international conference on Language Resources and Evaluation (LREC), Istanbul,...
Daniel Rinser wins award for his masters thesis
IQ Best Master Degree Wettbewerb der Deutschen Gesellschaft für Informations- und Datenqualität e....
HPI TV releases video about GovWILD
See the new video about our Government Data Integration platform GovWILD.
Tool voidGen released
As part of our winning submission at the 2010 Billion Triple Challenge at the International...
ICDE Paper Accepted
28th IEEE International Conference on Data Engineering (ICDE) Washington, DC, USA Adaptive...
Project Members: Sven Puhlmann, Felix Naumann, Melanie eis
Web site: www.hpi.uni-potsdam.de/~naumann/projekte/completed_projects/dirtyxml.html
Whenever there is a need to integrate data from various data sources, certain algorithms are used that have the ability to clean the integrated data. In order to test these algorithms one needs "dirty" sample data. The Dirty XML Data Generator is a tool written in Java that creates a dirty XML data file given a clean XML document and a set of parameters. According to the parameter set, the generated data can contain errors of different type, such as duplicates and misspellings, and is used to benchmark algorithms that clean nested integrated XML data.
The Dirty XML Data Generator was implemented by Sven Puhlmann in the context of a student research project.
1. Main Features
- Flexible and fast generation of dirty XML data
- Extensible implementation
- Algorithms can be added in order to pollute character data in a specific way by implementing a very simple Java Interface.
- Clearly arranged parameter definition in an XML file with reusable components: the parameterised algorithm specifications
2. Sample of dirty XML data generation
1 | <?xml version="1.0"?> |
1 | <?xml version="1.0"?> |
Some examples: The element person will never be deleted but duplicated every time it occurs in the source file. The maximum number of duplicates generated is 1. The attribute with the name sex is contained in the person element. Its string values will be deleted with a probability of 0.3 and, if not deleted, will be changed with an 0.75 probability using the swap1 algorithm defined above. As it is the only algorithm used to pollute this attribute, the usage probability is 1. In the other case there is the element address, whose chars will be polluted using two different algorithms ( swap2 and del1) used with a probability of 0.8 and 0.2, respectively. Note that the probabilities must add up to 1 (that means 100%).
For a detailed explanation of the parameters please have a look at the Detailed Documentation.
Executing the Dirty XML Data Generator with the clean XML file, the parameter XML file, and the name of the dirty XML file (here: persons_dirty.xml) as input leads to the following result:
1 | <?xml version="1.0" encoding="UTF-8"?> |
In the first 17 lines you will see the person elements of the source file that have not been polluted (as requested with the errorsInAncestors attribute in the root elements of the parameter file). The lines 18 to 35 contain the same elements, but polluted (we defined a duplication probability of 1 and that at most one duplicate should be created). They contain the dirty data according to the parameters.
3. Terms of use
The software is free for academic purposes. We would very much
4. Download
- the complete distribution containing the JAR file, the required JDOM library, an example and the full Technical Report of the student research project (in German) and
- the JAR file only. In this case you need to download the JDOM library as well and add it to your classpath.


