Announcing RDig
RDig aims to be an easy to use tool for building and searching a full text index of the contents of a web site. It consists of an HTTP crawler and facilities to extract textual content from HTML pages, which then will be indexed using the great Ferret full text search engine.
I initially wrote this to implement the site search feature of a website where most of the contents are static html pages generated by a CMS, and some dynamic features of the site are implemented in Rails.
Basically RDig takes a start url, a number of host names to limit the crawling to, and then starts crawling the site. It comes with an executable that can be used to regularly rebuild the index, e.g. triggered by cron
.
Searches are then executed with a simple
RDig.searcher.search('your query string here')
from within the web app.
Have a look at the RDocs for further information. Installation should be as simple as
gem install rdig
once the gem has propagated through rubyforge’s mirrors. This is the first piece of software I release as a gem, so please notify me of any problems you encounter.