Friday, November 2, 2007

Detecting duplicates in BibSonomy

One feature we added recently was the detection of duplicate references in a user's publication list. During the design of the system we had a discussion how to find links between references of different users if they are not identical. Therefore we had to solve two problems: First we have to find the duplicate entries and second it has to be fast as nearly all pages check for duplicate entries to provide a nice browsing.

The solution we came up were hash keys. The system is able to handle four different hash keys. Currently we use two of them, the intrahash and the interhash.

The intrahash avoids duplicates in the users library and tries to find only entries mostly identically. To compute this hash we use the title, type, author, editor, year,journal, booktitle, volume,number fields with only minor normalization. This hash also ensures that a user can only have a certain publication once in his library but the entry has to by nearly 100% identically.

The interhash key was designed to find as many similar publications as possible to support browsing within the system and to point users to other users with similar interests. Therefore the hash key is based only on title, year and author/editor information heavily normalized. In this way we can identify also entries which rely on different spelling of e.g. author names.

The new duplicate detection feature bases the duplicate detection on the interhash to detect duplicates in the library of a users. As the intrahash key reacts on nearly every change in an entry it allows to store also very similar entries with e.g. only a small change in the booktitle. The interhash key is able to detect those similar entries and list all publications of a user which appears at least twice within the users publication list. Checking this list you can remove unwanted duplicates and cleanup your your publications list.

We hope this feature is helpful. Have fun

Andreas