Hasso Plattner Institut
ImprintData Privacy

DuDe - Details

The bib-item deduplication service is powered by the DuDe toolkit, which supports duplicate detection on various types of import data. The DuDe toolkit offers various algorithms from literature, different output formats and several utility classes, e.g., to gather statistics, generate transitive closures, etc. The toolkit is developed and maintained by the Information Systems Group at the Hasso-Plattner Institute, Potsdam. Please visit the project's homepage for further details.

The following Java code listing shows how easy it is to use DuDe to detect duplicates. This code snippet is also used in the backend of this service in a slightly different form.

// ...

// initializes data source
BibtexSource source = new BibtexSource("bibtex", bibtexFile); // "bibtex" is the source id and bibtexFile represents the uploaded bibtex file
source.addIdAttributes(BibtexSource.KEY_ATTRIBUTE); // specifies the id attribute

// initializes the algorithm and enables in-memory processing
SortingKey sortingKey = new SortingKey(new TextBasedSubkey("title")); // the sorting key defines the sorting order within SNM
int windowSize = 30; // the window size defines the search range
SortedNeighborhoodMethod algorithm = new SortedNeighborhoodMethod(sortingKey, windowSize);
algorithm.enableInMemoryProcessing();
algorithm.addDataSource(source);

// instantiates the used similarity function
SimilarityFunction similarityFunction = new BibtexSimilarityFunction();

// all duplicates are collected
List<DuDeObjectPair> result = new ArrayList<DuDeObjectPair>();

// duplicate detection using a threshold of 0.9
for (DuDeObjectPair pair : algorithm) {
   if (similarityFunction.getSimilarity(pair) > 0.9) {
      result.add(pair);
   }
}

// clean up
algorithm.cleanUp();

// ...

The bib-item Similarity Function

The similarity between two entries in a bib-file is calculated using a weighted average of similarity functions that are based on BibTeX attributes. Two attributes are equal, if their similarity is 1.0. Each of the similarity functions is described in the following table:

Attribute Similarity Function Weight
author Jaccard Similarity (including recognition of abbreviations) 2
pages1, 2 Levenshtein Distance 0 or 1: 1 - Normalized Levenshtein Distance 2
otherwise: 0.0
title 1 - Normalized Levenshtein Distance 2
type types equal: 1.0 2
types are article and inproceedings: 0.5
otherwise: 0.0
year1 years equal: 1.0 1
difference of 1: 0.8
otherwise: 0.0
1 these similarity functions are ignored, if at least one of the attribute's values is missing
2 removes all spaces and hyphens before calculating the similarity