segunda-feira, 25 de julho de 2011

Crawling

Just finished writing the pub-sub crawler. Difficult? No. Actually, it was pretty easy with the Smackx PubSub API.

How it works? Well, at first the user declares a list of pubsub servers to be crawled in the properties file. Then the crawler sends a disco#info packet to all of these servers, asking for their public channels.

After getting that channel list, the crawler invokes several NodeCrawlers, that will crawl information from the node. Currently we crawl node metadata, items (posts) and subscribers (followers). These data will feed our Solr engine and a PostgreSQL database that will be used by Mahout for user recommendation.

The crawler is ready, and now I am looking forward to testing it against the new buddycloud channel server!

sexta-feira, 8 de julho de 2011

buddycloud's search engine is up and running!

As the title says, we have put the search engine to work :)

The prototype returns data provided by the buddycloud team, that is a snapshot of their database. Available queries are for nearness, channel metadata and post content, besides channel recommendation for users and channel similarity. The syntax of these query (and responses) can be found at the Channel Directory protocol page.

The XMPP component is running at search.buddycloud.com. Please try it! Some examples of queries (and their semantics) follow on:

Querying for the 10 nearest channels from lat/lon (45.4, 12.3):

<iq id="a65d24" to="search.buddycloud.com" from="abmar@abmar-pc" type="get">
<query xmlns="http://buddycloud.com/channel_directory/nearby_query">
<point lat="45.4" lon="12.3"/>
<set xmlns="http://jabber.org/protocol/rsm">
<max>10</max>
</set>
</query>
</iq>

Querying for the 10 first channels that contain 'Beer' in their metadata:

<iq id='oseq987e' to='search.buddycloud.com' from='abmar@abmar-pc' type='get'>
<query xmlns='http://buddycloud.com/channel_directory/metadata_query'>
<search>Beer</search>
<set xmlns='http://jabber.org/protocol/rsm'>
<max>10</max>
</set>
</query>
</iq>

Querying for the last 5 posts that contain 'Justin Bieber':

<iq id='oss09dje' to='search.buddycloud.com' from='abmar@abmar-pc' type='get'>
<query xmlns='http://buddycloud.com/channel_directory/content_query'>
<search>Justin bieber</search>
<set xmlns='http://jabber.org/protocol/rsm'>
<max>5</max>"
</set>
</query>
</iq>

Asking for 5 channel recommendations for 'simon@buddycloud.com':

<iq id='oseq987e' to='search.buddycloud.com' from='abmar@abmar-pc' type='get'>
<query xmlns='http://buddycloud.com/channel_directory/recommendation_query'>
<user-jid>simon@buddycloud.com</user-jid>
<set xmlns='http://jabber.org/protocol/rsm'>
<max>5</max>
</set>
</query>
</iq>

Asking for 10 similar channels to '/channel/food':

<iq id='oseq987e' to='search.buddycloud.com' from='abmar@abmar-pc' type='get'>
<query xmlns='http://buddycloud.com/channel_directory/similar_channels'>
<channel-jid>/channel/food</channel-jid>
<set xmlns='http://jabber.org/protocol/rsm'>
<max>10</max>
</set>
</query>
</iq>

segunda-feira, 20 de junho de 2011

Recommendation with Mahout

The recommendation engine just gave its first recommendations :)

I converted the taste dump Simon (my mentor) sent to me to a Mahout data source and started a 10-recommendation query for simon@buddycloud.com.

I used an boolean preference user-based recommender, based on a log likelihood user similarity strategy.

The query results follow on: [/channel/beer, /channel/buddycloudscocktailbar, /channel/football, /channel/cebit, /channel/bbc-news, /channel/hangover, /channel/welovemusic, /channel/lyrics, /channel/heavymetal, /user/blugeni@buddycloud.com/channel]

And, according to Simon:

Simon Tennant: hello. Yes, those suggestions seem about right!
Simon Tennant: wow!

For a data source of ~10k users, the query responses are pretty real time.

The next step consists of writing a channel similarity query handler, so new users can get recommendations based on a given channel.

segunda-feira, 13 de junho de 2011

Pub-sub search engine: Progress report

Things are moving very fast this year. It's June already, and lots of code have been written :)

For this year's GSoC I am implementing a Pub-sub search engine for XMPP, that has search, recommendation and crawling capabilities. I've been working with the buddycloud guys and it's been great.

We have weekly meetings via a MUC (seehaus@channels.buddycloud.com).
Project details can be viewed at github (https://github.com/buddycloud/channel-directory) and at the buddycloud wiki (http://buddycloud.org/wiki/Main_Page).
And now I am also using this blog for progress report.

This project can be split into three main phases: the search engine, the recommendation engine and the crawler. Up to now, I've already finished the first phase, that consists of implementing a XMPP component that enables a client to search for nearby channels, channels' metadata and posts' content. For this part of the project, we used the Apache Solr search engine (http://lucene.apache.org/solr/), that indexes documents in a well-defined xml format. All I had to do was setting up two Solr cores, one for channels and another for posts. Then the crawler gathers new/uptaded content from the channel server firehose and commits to Solr.

An architecture diagram may help:





As a good side effect of writing the code that commits documents to Solr, we already have some part of the crawler mechanism already implemented.

The prototype for the search engine is fully functional, and can be used for client testing, as long as you have some way to commit documents into Solr. Now I am studying Mahout for the recommendation engine. It seems we are in the right way to have it running sooner than expected.

quarta-feira, 30 de março de 2011

Absence

Been out for a while, right?
Anyways, this blog was intended to report my GSoC 2010 project updates.
Since my mentor asked me to report my progress in another fashion, this thing was kind of abandoned :(

For the ones that are interested on how my last year project was, here are a few links:

As you can see, it was a hard work indeed (and it still is - my patch in currently under review).

But I am coming back! I am thinking of applying to XSF this time. They posted a great idea that really caught my eye! The Open Social Location Server (http://wiki.xmpp.org/web/Summer_of_Code_2011_Project_Ideas)

This project involves XMPP, Geolocation and DB scaling techniques.

Now it is time to write the proposal.
Wish me luck!

quinta-feira, 29 de abril de 2010

GSoC2010: Accepted!

ZooKeeper Failure Detector Model proposal accepted!

There was some press coverage over GSoC acceptance in my University:
http://bit.ly/c91XfW (In portuguese)

quinta-feira, 1 de abril de 2010

GSoC 2010: ZooKeeper Failure Detector Model

I was almost giving up on finding a GSoC project that really makes me say "Oh God, I really want to do that". However, a lab colleague told me ZooKeeper was looking for someone to abstract the failure detector model in their code and to implement a better-tuning failure detector.

After reading the issue in Apache issue tracker, I got conviced that it would be a very very interesting project to be done, since I have worked with similar things (also implemented in Java) in the OurGrid code.

This week I got in touch with Henry Robinson -- the possible mentor for this project -- and he pointed some classes to be given a first look in the ZooKeeper code.

Next step is to write the project proposal, I hope I get accepted =)