segunda-feira, 20 de junho de 2011

Recommendation with Mahout

The recommendation engine just gave its first recommendations :)

I converted the taste dump Simon (my mentor) sent to me to a Mahout data source and started a 10-recommendation query for simon@buddycloud.com.

I used an boolean preference user-based recommender, based on a log likelihood user similarity strategy.

The query results follow on: [/channel/beer, /channel/buddycloudscocktailbar, /channel/football, /channel/cebit, /channel/bbc-news, /channel/hangover, /channel/welovemusic, /channel/lyrics, /channel/heavymetal, /user/blugeni@buddycloud.com/channel]

And, according to Simon:

Simon Tennant: hello. Yes, those suggestions seem about right!
Simon Tennant: wow!

For a data source of ~10k users, the query responses are pretty real time.

The next step consists of writing a channel similarity query handler, so new users can get recommendations based on a given channel.

segunda-feira, 13 de junho de 2011

Pub-sub search engine: Progress report

Things are moving very fast this year. It's June already, and lots of code have been written :)

For this year's GSoC I am implementing a Pub-sub search engine for XMPP, that has search, recommendation and crawling capabilities. I've been working with the buddycloud guys and it's been great.

We have weekly meetings via a MUC (seehaus@channels.buddycloud.com).
Project details can be viewed at github (https://github.com/buddycloud/channel-directory) and at the buddycloud wiki (http://buddycloud.org/wiki/Main_Page).
And now I am also using this blog for progress report.

This project can be split into three main phases: the search engine, the recommendation engine and the crawler. Up to now, I've already finished the first phase, that consists of implementing a XMPP component that enables a client to search for nearby channels, channels' metadata and posts' content. For this part of the project, we used the Apache Solr search engine (http://lucene.apache.org/solr/), that indexes documents in a well-defined xml format. All I had to do was setting up two Solr cores, one for channels and another for posts. Then the crawler gathers new/uptaded content from the channel server firehose and commits to Solr.

An architecture diagram may help:





As a good side effect of writing the code that commits documents to Solr, we already have some part of the crawler mechanism already implemented.

The prototype for the search engine is fully functional, and can be used for client testing, as long as you have some way to commit documents into Solr. Now I am studying Mahout for the recommendation engine. It seems we are in the right way to have it running sooner than expected.