RDig is a tiny web and file system crawler built on top of the Ferret search engine. It’s one of my less active side projects and from what I can tell doesn’t have a very large user base. However there are some people out there who actually use it, and some of those people even tell me so and suggest new features from time to time :-)
Limit crawling depth
You can now configure a maximum crawling depth to restrict RDig to only index pages up to this level. For example, setting
config.crawler.max_depth = 1 will make RDig only index the configured start pages, and pages the start pages directly link to. You get the picture I guess.
This option is especially useful if restricting RDig to a pre-defined number of hosts is not an option for your use case, but you still don’t intend to have it crawl the whole web.
HTTP proxy auth support
If you are behind a proxy and have to use HTTP Basic Authentication with it to get through, you can specify proxy url, user name and password:
cfg.crawler.http_proxy = "http://yourproxy:8080"
cfg.crawler.http_proxy_user = "username"
cfg.crawler.http_proxy_pass = "secret"
Under the hood
I put some work into refactoring parts of RDig in order to make integration with acts_as_ferret easier. I’ll write more about that in another post.
RDig is available as a gem via Rubyforge.