How it works? Well, at first the user declares a list of pubsub servers to be crawled in the properties file. Then the crawler sends a disco#info packet to all of these servers, asking for their public channels.
After getting that channel list, the crawler invokes several NodeCrawlers, that will crawl information from the node. Currently we crawl node metadata, items (posts) and subscribers (followers). These data will feed our Solr engine and a PostgreSQL database that will be used by Mahout for user recommendation.
The crawler is ready, and now I am looking forward to testing it against the new buddycloud channel server!