Friday, February 19, 2010

Digging up Resources - Fulltext search in BibSonomy

Background:
For a while now we were redesigning BibSonomy's full text search backend and now we decided that it is mature enough for mastering all of BibSonomy's search requests.

Our old backend was based on MySQL, using the MyISAM storage engine. But with all your Posts enlarging the search index each day, we nearly reached our server's capacity. Looking for a more efficient way of implementing full text search, we stumbled upon Lucene, a highly optimized search engine library, which is incorporated by the Apache Jakarta Project family since September 2001.

Now all of BibSonomy's full text search queries are handled by two redundant Lucene indexes, which are alternatively updated every 5 minutes.

Impact on your daily "BibSonomy-Experience":
First of all, switching to Lucene was an important step for preparing our servers to deal with even more users joining the BibSonomy community, as the search task now is separated and can be distributed among several independent machines. Secondly we hope to decrease BibSonomy's already small response time. But finally we now support more sophisticated search queries like "collaborative AND (b*marking OR ressource*)".

If you have any suggestion or encounter any problem, please contact us.

Happy Tagging!

Friday, February 5, 2010

Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity

Maybe you've asked yourself from time to time "What are these BibSonomy developers doing the whole day?" Of course the first answer is simple - we develop BibSonomy :) - but apart from that, most of us are researchers, running experiment, discussing results, writing papers - and the latter is sometimes rewarded: Our work "Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity" got accepted at this year's WWW conference in Raleigh, USA!

As you can guess from the title, the paper is basically concerned with emergent semantics. This term is often used to describe semantic structures that "grow" in a bottom-up and uncontrolled manner within collaborative tagging systems. For the case of emergent tag semantics this means that despite people are free to choose arbitrary tags (which leads to typical language-related phenomenons like homonymy, polysemy, ..), one can successfully extract meaningful tag relations from the aggregated mass of tagged content. As an example, different people might use different tags to describe the web2.0 paradigm, possibly "web2.0", "web-2.0", "webtwo", "web20", "web.2.0", and many others. By using the appropriate tag relatedness measures, one can identify those cases and extract a semantic "concept" web2.0 which all these users are talking about.

Up to this point, there's nothing too new - the question we asked ourselves then was how the characteristics of individual users influence the quality of the learned semantic structures. One possibility is to distinguish users according to their tagging motivation into "Categorizers" and "Describers" - the first group uses a small and systematic vocabulary, wherby the latter uses a wealth of different keywords for annotation. Simply spoken, describers can be seen as the "verbose" users tagging with many keywords. So we splitted up the whole folksonomy dataset into several partitions containing different mixtures from categorizers and describers. And here is an interesting thing we found:

On the x-axis, you see the percentage of included users. The y-axis depicts the quality of the inferred semantic tag relations (measures by grounding against a thesaurus; as we used the JCN distance, smaller values indicate better quality). The green line depicts the semantic quality obtained from the full dataset. The interesting thing is now that already with 40% of the "talkative" describers, one can reach the semantic precision of the full dataset! The best quality is found for 70% of describers. So the claim that "mass matters" holds only partially - a crucial aspect seems to be from which kind of users the mass is composed. The collaborative verbosity of describers seems to have a positive effect on the emergent semantics. On a more general level, this exhibits a causal link between tagging pragmatics (how people tag) and tag semantics (what tags mean). If you're interested in further details, we'd be happy to discuss with you on WWW2010!