Maybe you've asked yourself from time to time "What are these BibSonomy developers doing the whole day?" Of course the first answer is simple - we develop BibSonomy :) - but apart from that, most of us are researchers, running experiment, discussing results, writing papers - and the latter is sometimes rewarded: Our work "Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity" got accepted at this year's WWW conference in Raleigh, USA!
As you can guess from the title, the paper is basically concerned with emergent semantics. This term is often used to describe semantic structures that "grow" in a bottom-up and uncontrolled manner within collaborative tagging systems. For the case of emergent tag semantics this means that despite people are free to choose arbitrary tags (which leads to typical language-related phenomenons like homonymy, polysemy, ..), one can successfully extract meaningful tag relations from the aggregated mass of tagged content. As an example, different people might use different tags to describe the web2.0 paradigm, possibly "web2.0", "web-2.0", "webtwo", "web20", "web.2.0", and many others. By using the appropriate tag relatedness measures, one can identify those cases and extract a semantic "concept" web2.0 which all these users are talking about.
Up to this point, there's nothing too new - the question we asked ourselves then was how the characteristics of individual users influence the quality of the learned semantic structures. One possibility is to distinguish users according to their tagging motivation into "Categorizers" and "Describers" - the first group uses a small and systematic vocabulary, wherby the latter uses a wealth of different keywords for annotation. Simply spoken, describers can be seen as the "verbose" users tagging with many keywords. So we splitted up the whole folksonomy dataset into several partitions containing different mixtures from categorizers and describers. And here is an interesting thing we found:
On the x-axis, you see the percentage of included users. The y-axis depicts the quality of the inferred semantic tag relations (measures by grounding against a thesaurus; as we used the JCN distance, smaller values indicate better quality). The green line depicts the semantic quality obtained from the full dataset. The interesting thing is now that already with 40% of the "talkative" describers, one can reach the semantic precision of the full dataset! The best quality is found for 70% of describers. So the claim that "mass matters" holds only partially - a crucial aspect seems to be from which kind of users the mass is composed. The collaborative verbosity of describers seems to have a positive effect on the emergent semantics. On a more general level, this exhibits a causal link between tagging pragmatics (how people tag) and tag semantics (what tags mean). If you're interested in further details, we'd be happy to discuss with you on WWW2010!