Friday, October 7, 2011

Feature of the week: person name normalization in "Last, First" form

As announced earlier, we changed the format of person names for author and editor fields in publication posts from First Last to Last, First in the last release. Since this has quite some implications, I would like to discuss the changes a bit more in detail in this feature of the week.

Why "Last, First"?


The change from the First Last format (e.g., "D.E. Knuth") to Last, First ("Knuth, D.E.") was an overdue step that has been requested by many users of BibSonomy (see also the comments on our blog post).

The old format did not allow our users to correctly store user names that contain two last names. For example, in the name of our colleague Beate Navarro Bullock the first name was erroneously detected to be "Beate Navarro" but it really is only "Beate". When such data was exported from BibSonomy, other systems could not repair it. For instance, in a literature list this would have caused a wrong position of the reference (under "B" instead of "N" in our example). Furthermore, name based citations like [Navarro Bullock et al., 2009] would have used the wrong last name ("Bullock" instead of "Navarro Bullock").

With the new Last, First format (Navarro Bullock, Beate), the name is correctly stored and recognized by other applications and BibSonomy itself (e.g., for the author pages). BibSonomy does no longer "destroy" correctly entered names!

By the way, the new format also correctly handles the lineage (e.g., "Jr.") of persons. You can enter it in the form Last, Jr, First and you will get it back in the same format.

Interaction with BibSonomy


Naturally, this change is of big importance to all users and applications that import data into BibSonomy or export their data from BibSonomy. Since BibSonomy supports a wide variety of import and export formats, I will here briefly explain how each format is handled.

  • In general, each input format supports both First Last and Last, First. E.g., XML and JSON with BibSonomy's REST-API and the BibTeX and EndNote import in the web interface.
  • When editing a publication reference, you can basically use the format you like. However, after saving the post, the names will be normalized and when you edit it the next time, they will be in Last, First
    order:
    This has the advantage that you can see if both parts of the name were correctly identified.
  • Both the XML (format=xml) and the JSON (format=json) output of BibSonomy's REST-API use Last, First for the author and editor attributes of the bibtex element.
  • Nothing has changed in BibSonomy's "regular" JSON export. Person names are returned in "First Last" form:
    "author": [ 
        "Douglas Crockford"
      ],
    
    After the next release (scheduled for October 26th), the JSON output will additionally include the fields "authors" and "editors" (notice the "s" at the end) with separated "first" and "last" parts:
    "authors": [ 
        {"first" : "Douglas", "last" : "Crockford"}
      ],
    
  • The BibTeX export now returns authors and editors in Last, First form. You can change this to First Last by adding the parameter firstLastNames=on to the URL. Alternatively, you can use our new export dialog that is triggered by moving the mouse over the BibTeX export link on each page:
  • The EndNoteRIS (EndNote), and RIS (ReferenceManager) export all use the Last, First format.
  • For all other export formats we did not change the person name format (to the best of my knowledge).
Please note that most applications (like BibTeX, JabRef, Citavi, or EndNote) support both types of formats anyway and thus Last, First should not bring you any problems but rather the possibility to correctly represent more types of person names than First Last.

What else did change?

During our intensive tests of the new format we realized that BibSonomy has quite a lot of data that is - to be hones - broken, dirty, inconsistent. E.g., author fields like "A. Einstein and" or "Knuth, D.E., Kleinberg, J.", etc. In short: strings where often there is no hope that we can automatically and correctly clean them.
What was more important, however, was the fact that due to the new normalization some hashes of posts, i.e., their unique identifiers changed. When we realized this we thought about the implications and finally looked at the numbers - of 2795609 posts only 20215 posts (less than one percent) changed. And as said - almost all of these posts had broken or "dirty" person names and clearly stemmed from broken batch imports.