RDig 0.3.5

February 26, 2008 | rdig, ferret, ruby

RDig is a tiny web and file system crawler built on top of the Ferret search engine. It’s one of my less active side projects and from what I can tell doesn’t have a very large user base. However there are some people out there who actually use it, and some of those people even tell me so and suggest new features from time to time :-)

Limit crawling depth

You can now configure a maximum crawling depth to restrict RDig to only index pages up to this level. For example, setting config.crawler.max_depth = 1 will make RDig only index the configured start pages, and pages the start pages directly link to. You get the picture I guess.

This option is especially useful if restricting RDig to a pre-defined number of hosts is not an option for your use case, but you still don’t intend to have it crawl the whole web.

HTTP proxy auth support

If you are behind a proxy and have to use HTTP Basic Authentication with it to get through, you can specify proxy url, user name and password:

cfg.crawler.http_proxy = "http://yourproxy:8080" cfg.crawler.http_proxy_user = "username" cfg.crawler.http_proxy_pass = "secret"

Under the hood

I put some work into refactoring parts of RDig in order to make integration with acts_as_ferret easier. I’ll write more about that in another post.

Get it!

RDig is available as a gem via Rubyforge.