
Prof. Dr. Felix Naumann
Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Str. 2-3
D-14482 Potsdam, Germany
Paper accepted at SSDBM
Proceedings of the 24th International Conference on Scientific and Statistical Database...
JWS Article Accepted
Integrating Open Government Data with Stratosphere for more Transparency Arvid Heise and Felix...
LREC Paper Accepted
The eighth international conference on Language Resources and Evaluation (LREC), Istanbul,...
Daniel Rinser wins award for his masters thesis
IQ Best Master Degree Wettbewerb der Deutschen Gesellschaft für Informations- und Datenqualität e....
HPI TV releases video about GovWILD
See the new video about our Government Data Integration platform GovWILD.
Tool voidGen released
As part of our winning submission at the 2010 Billion Triple Challenge at the International...
ICDE Paper Accepted
28th IEEE International Conference on Data Engineering (ICDE) Washington, DC, USA Adaptive...
Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes.
With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values to independently extract value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes.

Extracted Data
We ran iPopulator on the complete Wikipedia dump (as of December 2010). We could successfully extract many new infobox attribute values. In the following, we provide the extracted data in three formats:
- Raw data: Contains list of tab-separated extraction data (article name, attribute, value) with raw values in MediaWiki syntax
- CSV: Contains list of comma-separated triples (article name as subject, attribute as predicate, extracted value as object)
- N3: Extracted triples (article name as subject, attribute as predicate, extracted value as object) in N3/Turtle syntax (a readable serialization format for RDF)
Note that while the raw data contains multi-values (e.g., a list of names as value for the attribute key_people in infobox_company), these values have been split-up into several triples for CSV and N3. For these two formats, corrupted links have been removed and all subjects, properties, and links in values have been transformed into resources or properties. In general, we use DBpedia resource and property URIs for our dataset. For clarity reasons, however, all additional resources and properties extracted that are not part of DBpedia use the namespace http://hpi-web.de/naumann/ipopulator.
iPopulator automatically evaluates its extraction performance using existing infobox attribute values as test data. This allows us to extract new values only for promising infobox attributes. We provide extracted data with three different levels of minimum extraction precision (based on the test data).
Extraction precision | # extracted values | # triples generated from extracted values | Download | ||
>= 80% | 259,892 | 307,700 | |||
>= 90% | 149,150 | 198,529 | |||
>= 95% | 109,345 | 158,115 | |||
The extracted data is provided for free use in any application. If you would like to use the data, we would be glad to hear about it. If you would like to cite our work, please refer to our CIKM paper [1].
Contact
If you have any questions or comments, please contact Dustin Lange.
Publications
| Export als BibTeX | |
| 1. |
Dustin Lange and Christoph Böhm and Felix Naumann.
Extracting structured information from Wikipedia articles to populate infoboxes.
In
Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM),
pages 1661-1664,
Toronto, Canada,
2010.
|
| 2. |
Dustin Lange and Christoph Böhm and Felix Naumann.
Extracting structured information from Wikipedia articles to populate infoboxes.
Technical Report 38,
Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam,
2010.
ISBN 978-3-86956-081-6, ISSN 1613-5652
|
| Export als BibTeX |


