
Prof. Dr. Felix Naumann
Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Str. 2-3
D-14482 Potsdam, Germany
Paper accepted at SSDBM
Proceedings of the 24th International Conference on Scientific and Statistical Database...
JWS Article Accepted
Integrating Open Government Data with Stratosphere for more Transparency Arvid Heise and Felix...
LREC Paper Accepted
The eighth international conference on Language Resources and Evaluation (LREC), Istanbul,...
Daniel Rinser wins award for his masters thesis
IQ Best Master Degree Wettbewerb der Deutschen Gesellschaft für Informations- und Datenqualität e....
HPI TV releases video about GovWILD
See the new video about our Government Data Integration platform GovWILD.
Tool voidGen released
As part of our winning submission at the 2010 Billion Triple Challenge at the International...
ICDE Paper Accepted
28th IEEE International Conference on Data Engineering (ICDE) Washington, DC, USA Adaptive...
Project members
DAQS (DAta Quality as a WebService) is a comprehensive data clensing project. Its goal is to provide the full duplicate detection workflow via webservice with as less manual interaction as possible. The challenge is to enable the computer taking decisions that usually a human expert takes.
Workflow
The workflow consists mainly of three steps.
- In the problem classification phase, the available dataset is analyzed and the degree of missing information is estimated. (see below)
- If the semantics (a.k.a fine-grained data types) of the dataset is unclear, classes have to be assigned to the attributes.
- The actual duplicate detection is performed with the similarity measures derived from the classes.
1. Problem Classification
There are four types of datasets, that can be present.
- Datasets can have semantic annotations for the attributes (and a mapping/a separator). Consequently, the duplicate detection task can be performed nearly automatically.
- The datasets only have a mapping (and a separator). Thus, it is clear, which attributes to compare, but not, how.
- In datasets without a mapping, only the tuples and attributes are distinguishable, but it is not clear, which attributes to compare with which other attributes.
- In case of unstructured documents, not even tuples/attributes can be recognized. They have to be retrieved, first.
2. Attribute Classification
In case that there are no semantics assigned to the dataset's attributes, they have to be assigned by the service. In DAQS, we use an instance-based as well as an machine learning classification approach to do that.
The figure shows the instance-based and machine learning classification classes.
Datasets
Usage | Dataset | Description | Source of the original dataset |
|---|---|---|---|
Instance-based classification | This is the (only) instance-based dataset. Some attributes are removed for confidentiality. Consequently, the classification results will be a bit worse. | ||
Machine learning classification | This is the training dataset for machine learning. It is a melange from different sources mentioned here, but without overlapping any of the other datasets. (Usually, only the first 500 tuples are used.) | ||
Machine learning classification | This is a file of voters in the Cerlmont county in the USA. | http://www.clarkcountynv.gov/Depts/election/Pages/VoterDataFiles.aspx | |
Machine learning classification | This is a generated dataset with all available attributes from fakenamegenerator.com. | ||
Machine learning classification | This is a dataset generated by students during an earlier information integration lecture. Data are mostly crawled from Wikipedia and others. | ||
Machine learning classification | This is a dataset generated by students during an earlier information integration lecture. Data are mostly crawled from Wikipedia and others. | ||
Machine learning classification | This is a dataset generated by students during an earlier information integration lecture. Data are mostly crawled from Wikipedia and others. | ||
Machine learning classification | This dataset comes from an information integration assignment of the University of Arcansas at Little Rock. | ||
Machine learning classification | This dataset comes from an information integration assignment of the University of Arcansas at Little Rock. | ||
Machine learning classification | This dataset is crawled from Deutschland-API. | ||
Machine learning classification | This dataset is an overview over some mines in the USA. |
Often, the original datasets contained many null values. As we only took 500 tuples for our experiments, we selected 500 tuples from the best-filled tuples of each dataset randomly.
Demo
There is a little demo created for the wheelmap.org project.
Wheelmap-Service-Endpoint (RESTful)


