segunda-feira, 25 de julho de 2011

Crawling

Just finished writing the pub-sub crawler. Difficult? No. Actually, it was pretty easy with the Smackx PubSub API.

How it works? Well, at first the user declares a list of pubsub servers to be crawled in the properties file. Then the crawler sends a disco#info packet to all of these servers, asking for their public channels.

After getting that channel list, the crawler invokes several NodeCrawlers, that will crawl information from the node. Currently we crawl node metadata, items (posts) and subscribers (followers). These data will feed our Solr engine and a PostgreSQL database that will be used by Mahout for user recommendation.

The crawler is ready, and now I am looking forward to testing it against the new buddycloud channel server!

sexta-feira, 8 de julho de 2011

buddycloud's search engine is up and running!

As the title says, we have put the search engine to work :)

The prototype returns data provided by the buddycloud team, that is a snapshot of their database. Available queries are for nearness, channel metadata and post content, besides channel recommendation for users and channel similarity. The syntax of these query (and responses) can be found at the Channel Directory protocol page.

The XMPP component is running at search.buddycloud.com. Please try it! Some examples of queries (and their semantics) follow on:

Querying for the 10 nearest channels from lat/lon (45.4, 12.3):

<iq id="a65d24" to="search.buddycloud.com" from="abmar@abmar-pc" type="get">
<query xmlns="http://buddycloud.com/channel_directory/nearby_query">
<point lat="45.4" lon="12.3"/>
<set xmlns="http://jabber.org/protocol/rsm">
<max>10</max>
</set>
</query>
</iq>

Querying for the 10 first channels that contain 'Beer' in their metadata:

<iq id='oseq987e' to='search.buddycloud.com' from='abmar@abmar-pc' type='get'>
<query xmlns='http://buddycloud.com/channel_directory/metadata_query'>
<search>Beer</search>
<set xmlns='http://jabber.org/protocol/rsm'>
<max>10</max>
</set>
</query>
</iq>

Querying for the last 5 posts that contain 'Justin Bieber':

<iq id='oss09dje' to='search.buddycloud.com' from='abmar@abmar-pc' type='get'>
<query xmlns='http://buddycloud.com/channel_directory/content_query'>
<search>Justin bieber</search>
<set xmlns='http://jabber.org/protocol/rsm'>
<max>5</max>"
</set>
</query>
</iq>

Asking for 5 channel recommendations for 'simon@buddycloud.com':

<iq id='oseq987e' to='search.buddycloud.com' from='abmar@abmar-pc' type='get'>
<query xmlns='http://buddycloud.com/channel_directory/recommendation_query'>
<user-jid>simon@buddycloud.com</user-jid>
<set xmlns='http://jabber.org/protocol/rsm'>
<max>5</max>
</set>
</query>
</iq>

Asking for 10 similar channels to '/channel/food':

<iq id='oseq987e' to='search.buddycloud.com' from='abmar@abmar-pc' type='get'>
<query xmlns='http://buddycloud.com/channel_directory/similar_channels'>
<channel-jid>/channel/food</channel-jid>
<set xmlns='http://jabber.org/protocol/rsm'>
<max>10</max>
</set>
</query>
</iq>