Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Magritte: Structural embedding of data files

Data files = content + structure

Data is often encoded, distributed, and stored in data files. To access the content of these files, it is necessary to parse their structure. For some formats, file structure is well defined with strict standards: JSON, XML, etc..

For others, such as CSV and CSV-like files, structure is more loosely defined and it is often the result of custom decisions and ad-hoc adaptation to the RFC standard to fit a given use case. Therefore, users are often required to know the structure beforehand, for example to parse delimiters and quotation characters, remove possible comment or metadata lines, extract multiple tables etc. etc.

These tasks are all necessary steps required at the syntactical level, preparing the data file in order to properly loading its content and perform semantic tasks, whether further cleaning or downstream tasks.

Currently, structural data preparation is the elephant in the room. Every data practictioner has to deal with stuctural issues, yet little attention is devoted to making these efforts systematic, documented, or reproducible.

As a foundational step towards systematic data preparation, we envision an unamiguous representation for the structure of files, that could be used to automate different structural tasks, to assess the level of preparation necessary in order to load a given dataset, or as metadata for structural indexing of data files.

To serve these purposes we propose MaGRiTTE: an Machine Generated Representation of Tabular files using Transformer Encodings.

The MaGRiTTE architecture

MaGRiTTE is an automated approach to learn embeddings that capture the structural information of files.

Its core architecture is composed of two stages based on two popular architectures used in representation learning: transformer encoders and convolutional autoencoders.

First, we encode file rows using a transformer encoder architecture specifically pre-trained to learn structural features.

Transformer models, as demonstrated by the large successes in natural language processing tasks, are successful in learning the inherent structure of sentences and languages. Therefore, we specifically train a transformer model to learn the "language" of data files, i.e., their structure, with file rows corresponding to "sentences" of such language.

But file structure is not limited to rows, as columns and combination of rows can also have structural significance (for example, with files containing different tables, or sequences of preamble lines). For this reason, the second stage of MaGRiTTE is designed to  learn a file-wise structural encoding.

To do so, we use a variational autoencoder architecture composed of several convolutional layers, trained to reconstruct the stack of row-wise embeddings using a fixed length vector for file-wise representation.

The intution for the use of convolutional autoencoding is that file-wise structure can be understood by the occurrence of bi-dimensional row-column local features, which convolutional filters have proven successful in detecting.

The overall MaGRiTTE architecture is summarized in the image below.

Pipeline of MaGRiTTE

Pretraining phase

To pre-train the transformer architecture on structural features, we propose two core adaptations: a novel tokenization stage and specialized training objectives. To abstract the data content of a file, and train the transformer architecture on structural features, we introduce “pattern tokenization”: Assuming that structural properties are identifiable through special characters, we reduce all alphanumeric characters to a set of few general patterns. After tokenization, the rows of the input files are split on newline characters and a percentage of the special character tokens is masked before feeding it to the row encoder model. The  row-transformer model is then trained on two objectives, reconstructing the masked tokens, and identifying whether pairs of rows belong to the same file.

Pattern tokenization

The row embeddings produced by the first stage are then used as the input for the file embedding stage of MAGRiTTE. In this stage, the encoder and decoder models are trained using the reconstrocution loss on the row embeddings feature maps.

Due to the convolutional nature of the second stage, we fix the amount of rows in the input files to a given number (selected as a result of hyperparameter tuning), and either truncate the exceeding rows from the input file, or padding them by random replication of the existing rows in case the original file does not have enough.

The file-wise embedding vector is obtained, after the training stage, as the innermost fixed-length feature vector, which is used as the input of the encoder stage of the autoencoder structure.

Fine-tuning for downstream tasks

We evaluate the effectiveness of the learned structural representations on three tasks to analyze unseen data files:

  1. Fine-grained dialect detection, i.e., identifying the structural role of characters within rows,
  2. Line and cell classification, i.e.,identifying metadata, comments, and data within a file,
  3. Table extraction, i.e., identifying the boundaries of tabular regions.

We compare the use of MAGRiTTE encodings with state-of-the-art approaches that were specifically designed for these tasks. In future work, we aim at using MAGRiTTE embeddings in generative fashion to perform structural data preparation, e.g., changing file dialects, or as metadata for structural indexing of files in data lakes.

An example of task addressed with the structural embeddings of MaGRiTTE.

Code

Magritte is an open source project. Its code is available on its project page in GitHub.

Publications

  • G. Vitagliano, M. Hameed, A. Sierra-Mùnera, F. Naumann: Embedding File Structure for Data Preparation. Under submission.
  • G. Vitagliano, M. Hameed, F. Naumann: Structural Embedding of Data Files with MaGRiTTE. Table Representation Workshop at NeurIPS. 2022.

Contact

This research project is conducted at the Data Preparation group. If you have any question, interesting idea, or request, do not hesitate to contact Gerardo Vitagliano.