Hasso-Plattner-Institut für Softwaresystemtechnik
CD Datasets

Prof. Dr. Felix Naumann

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Str. 2-3
D-14482 Potsdam, Germany

CD Datasets

Dataset 1

This dataset includes 9763 CDs randomly extracted from freeDB.

  • Dataset
    The data was converted from plain to XML and is packed into a zip archive.
  • Duplicates (298 objects)
    A list of all duplicates in the dataset.

  • Schema of the dataset
    Here you get the schema of the dataset provided in a pdf file.

Dataset 2

This dataset was generated by extracting 500 clean CD objects from the FreeDB database and 500 artificially generated duplicates using the Dirty XML Data Generator (one duplicate for each CD).