
Prof. Dr. Felix Naumann
Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Str. 2-3
D-14482 Potsdam, Germany
Daniel Rinser wins award for his masters thesis
IQ Best Master Degree Wettbewerb der Deutschen Gesellschaft für Informations- und Datenqualität e....
HPI TV releases video about GovWILD
See the new video about our Government Data Integration platform GovWILD.
Tool voidGen released
As part of our winning submission at the 2010 Billion Triple Challenge at the International...
ICDE Paper Accepted
28th IEEE International Conference on Data Engineering (ICDE) Washington, DC, USA Adaptive...
GovWILD in LOD cloud
The GovWILD team is happy to announce that the latest version of the LOD cloud (September 2011)...
CoopIS Paper Accepted
The 19th International Conference on Cooperative Information Systems (CoopIS) Crete, Greece...
ICSOC Paper Accepted
Revealing Hidden Relations among Web Services Using Business Process Knowledge ... Mohammed...
5 Papers Accepted at CIKM 2011/ 1 Paper Accepted at the co-located SMER Workshop
Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow,...
The research goal of the Information Systems Group is the efficient and effective management of heterogeneous information in large, autonomous systems. This includes methods for data profiling, data cleansing, search, and metadata management. Please also see our welcome video.
Research topics
An article in the Data Engineering Bulletin gives a good overview of some of our research topics: "Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies" (2006). Further details can be found on our project and our publications page. In addition, we maintain a repeatability site to publish code and data.
- Data quality / information quality: The quality of data is measured in many different dimensions. Quality values can be aggregated along data operations, for instance to calculate the quality of query results.
Links: ICIQ 2009,
German: Schlagwort "Datenqualität" im Informatik Spektrum - Duplicate detection: Duplicates are multiple, different representations of the same real-world object, for instance, multiple records of a customer in a CRM database. Duplicate detection try to build systems that efficiently and effectively find such duplicates in large data sets.
Links: Synthesis lecture, repeatability, DuDe
German: Duplikaterkennung allgemeinverständlich - Linked Open Data (LOD): More and more sources provide data in RDF form as linked open data. Such data serves as use case in a variety of projects.
Links: HPI's open data activities, ProLOD - Service-oriented Computing (SOC): SOC has been a popular approach to enterprise and distributed applications. It is typically achieved through Web Services. The increasing number of the offered Web Services over the web has been reflected in the limited usability of these Web Services. In our research, we aim at increasing the usability of public Web Services through Information Integration techniques, such as web crawling, annotation extraction, classification, etc.
Links: PoSR, Depot (online demo) - Similarity Search: Queries often do not exactly match desired objects in the data store. To also find similar matches for a query, a similarity measure as well as a similarity-aware index structure are necessary.
Links: Similarity search research project, Similarity Search Algorithms seminar (German) - Data profiling: When integrating heterogeneous sources, details of the schema, such as foreign key dependencies, are often unknown. We are developing data profiling methods to automatically detect these and other dependencies in very large databases. In the context of the Aladin project these methods are applied to life sciences databases.
Links:
German: ProLOD seminar - ETL Management: ETL-processes are defined to integrate heterogeneous data into a data warehouse. ETL management is the systematic, semi-automatic management of large sets of such processes. It includes several simple operators, such as IMPORT and SEARCH, and more complex operators, such as MATCH, MERGE, or INVERT.
Links: Bachelorprojects GrETL and Moritz, METL - Data Fusion: Data fusion is the process of fusing multiple records representing the same real-world object, i.e., duplicates, into a single, consistent, and clean representation. Challenges are scalability over large data volumes and conflict resolution of contradictory values.
Links: FuSem, Hummer, ACM computing survey, VLDB tutorial - Schema matching: Schema matching is the (semi-automatic) process of detecting attribute correspondences between two heterogeneous schemata. These correspondences can subsequently be used to create a schema mapping to be used for data transformation or data exchange.
Teaching
- Bachelor: We offer regular german lectures in database systems, namely Datenbanksysteme I (DBS I) und Datenbanksysteme II (DBS II). In addition we offer the regular seminar "Beauty is our Business" and many other project-oriented seminars.
One-year Bachelor Projects with 6-8 students finalize bachelor studies at HPI. Our group offers one or two such projects per year in cooperation with external partners. - Master: We alternately offer the bi-annual courses "Information Integration" and "Search Engines". In addition we offer diverse specialized seminars, some theoretical, some project-oriented.
Library
Our group library catalog is online. Books can be loaned.
The following word cloud was generated at http://www.wordle.net using this article.



