Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

TASHEEH - Cleaning ill-formed Records in CSV Files

For self-service data preparation, we propose TASHEEH, a system built on our state-of-the-art error detection system, SURAGH. TASHEEH aims to improve the processing of raw CSV files by standardizing the structure of ill-formed records into a uniform format based on the structure of well-formed records. It leverages syntax-based patterns both to understand the ill- and well-formedness of individual records in a file and to clean up their structure for uniformity.

Wanted and Unwanted Records

While ill-formed records often are table titles, footnotes, or empty rows (unwanted), many contain payload data (wanted), along with additional structural or formatting information and possibly additional attributes, which made them ill-formed in the first place. They may also be as structurally valid as well-formed records, but were classified as ill-formed due to their non-dominant structure. To recognize whether ill-formed records contain payload data or not, we classify these records as wanted or unwanted. Then, we clean the structure of wanted records and delete the unwanted records.

The workflow of TASHEEH

The workflow of TASHEEH consists of three phases. In the first phase, it first uses the output of its predecessor SURAGH to classify input file records as ill-formed or well-formed using dominant row patterns . Then, it runs SURAGH incrementally for ill-formed records to obtain row patterns specifically for those rows; we call these patterns potential dominant row patterns, as these ill-formed data rows can possibly be transformed into well-formed data rows. TASHEEH repeats the incremental pattern generation process until no dominant, ill-formed records are left.  After the first phase, TASHEEH obtains dominant and potential dominant patterns for well-formed and ill-formed records, respectively. The second phase uses these patterns to classify ill-formed records into wanted and unwanted. In the final phase, TASHEEH collects wanted records, well-formed records, and their patterns from the previous phase and removes the unwanted records. It then uses the pattern transformation grammar to transform the wanted records into well-formed ones.

Resources

Code & Datasets

TASHEEH is an open-source project. Its code along with datasets and annotations are available on the project page at GitHub.

Contacts

The research project is conducted at the Data Preparation group. If you have any question, interesting idea, or request, please contact Mazhar Hameed.