Harvest-NG is a collection of Perl modules and scripts which provide a powerful web crawling and summarizing agent. The code is aimed at providing an open source, standards compliant, tool for fetching content from a wide variety of information sources, summarising it into a set of resource descriptions, and storing these in an easily accessible database from which search services can be built and statistical information compiled.
I-Spy is a Perl script which identifies new files on various remote FTP and Web sites. It grabs and compares contents of FTP directories and web pages. It will then compile a report and either send it via e-mail or save it as a web page. You may also request both deliveries of the report.
For e-mail reports, you may request plain text or HTML. I-Spy logs its activity as it chugs along. You may specify the log
directory, or I-Spy will try to find one automatically. For web page reports, I-Spy will attempt to store the log in such a place where it may be referenced by the report and served by the web server.
This is a proof-of-concept of a tool to automate web browsing / data collection. It works like AWK except that instead of working on files and lines it works on HTML pages and hyperlinks. It is meant to be run as a command line script and includes base_url - the URL the script was initially invoked on, base_path - root of saved data tree, url - current URL being processed, linked_from - parent of current URL, and content - the actual data corresponding to the current URL.