Thursday, November 29, 2007

Feature of the Week: Character encoding of imported files

For this weeks feature of the week I'll first briefly discuss what a "character encoding" is and afterwards explain, why it is important during BibTeX import.

On a very low level, computers only understand zeros and ones. Hence, a mechanism is needed to encode symbols like letters and numbers as sequences of zeros and ones. A "table" which assigns to each symbol its corresponding zero-one sequence is called a character encoding (or character set). This table allows a computer to interprete the data in a file and show the correct symbol on the screen (or printer). Unfortunately, several such character encodings exist. Depending on the chosen character encoding, the same sequence of ones and zeros might stand for different symbols. To correctly display a piece of data, the computer must know its interpretation - its character encoding.

When uploading a BibTeX (or EndNote) file to BibSonomy, we face the same problem: we have to interprete the file with the correct character encoding. Typically, it's not possible to guess it (it's just an interpretation of the data - each interpretation could possibly be correct) so there is an option on the post_bibtex page which allows you to specify the character encoding of the file to upload. A click on the options link reveals a dropdown list which contains a choice of some typical character encodings. The default is "UTF-8" which is nowadays more and more common. However, older files might have a different encoding like "ISO-8859-1" (also known as "latin1"). If you're unsure about your data, UTF-8 is a good choice. If this gives you errors during import or strange looking characters afterwards, try another encoding. In Europe "ISO-8859-1" is very common, too.

Popular Posts