Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

13.12.2023

Wow, again, we are very happy to announce that in the past few days two papers were accepted to be presented at the EDBT conference in 2024. A great success. Please find more information below.

TASHEEH: Repairing Row-Structure in Raw CSV Files

AUTHORS
Mazhar Hameed (Hasso Plattner Institute)
Felix Naumann (Hasso Plattner Institute)
Fabian Panse (Hasso Plattner Institute)
Gerardo Vitagliano (Hasso Plattner Institute)

ABSTRACT
Comma-separated value (CSV) files follow a useful and widespread format for data exchange due to their flexible standard. However, due to this flexibility and plain text format, such files often have structural issues, such as unescaped quote characters within quoted fields, columns containing different value formats, rows with different numbers of cells, etc. We refer to rows that contain such structural inconsistencies as ill-formed. Consequently, ingesting them into a host system, such as a database or an analytics platform, often requires prior data preparation steps. Traditionally, data scientists write custom code to clean ill-formed rows, even before they can use data cleaning tools and libraries, which assume all data to be properly loaded. These tasks are tedious and time-consuming, requiring expertise and frequent human intervention. To automate this process, we propose Tasheeh, a system that automatically detects ill-formed rows containing data and then standardizes their structure into a uniform format based on the
structure of well-formed rows. Of 200 351 manually annotated rows from four different sources, Tasheeh was able to correctly detect 95.53% of data rows and accurately generate transformations for 87.83% of them.


Efficient Discovery of Temporal Inclusion Dependencies
in Wikipedia Tables

AUTHORS
Leon Bornemann (Hasso Plattner Institute)
Tobias Bleifuß (Hasso Plattner Institute)
Dmitri V. Kalashnikov (Unaffiliated)
Fatemeh Nargesian (University of Rochester)
Felix Naumann (Hasso Plattner Institute)
Divesh Srivastava (AT&T Chief Data Office)

ABSTRACT
Inclusion dependencies (INDs) demand that the value set that appears in one attribute is contained in the value set that appears in a different attribute. The automatic discovery of INDs in static data is a well-researched topic with many use-cases, such as foreign key discovery. However, data is usually not static, in fact data changes frequently, especially on Wikipedia. The availability of change data allows us to take a fresh look at the discovery of INDs in Wikipedia tables, by taking into account not only the current state of a dataset, but also its past versions. In this work, we formally define the concept of temporal INDs (tINDs) and introduce several relaxations, allowing for the discovery of tINDs in dirty data.We present an efficient index structure for unary tIND search that returns all valid tINDs for a user query in 63 milliseconds on average, allowing users to interactively explore tIND relationships in Wikipedia tables. Furthermore, we can use our index to discover the set of all valid tINDs between 1.3 million attributes from Wikipedia tables in less than three hours. Finally, we show empirically, that tIND discovery can help to find genuine INDs much more reliably than IND discovery on static data.