Web Robot Filter

Overview

A web robot is a program that recursively downloads web pages for the purpose of processing the downloaded pages in some useful way. For example, Google has many computers running robots that constantly download web pages and index the key words that appear on those pages. Although services such as Google are extremely useful, from the perspective of a web site administrator, robots can cause problems. Robots can significantly increase the load placed on a web site while its pages are being automatically downloaded (programs are much faster at downloading pages than people are), which may decrease the web site’s performance for regular users. In addition, it might be undesirable to include some parts of a web site in a key word index such as Google. For reasons such as these, web site administrators need control over if, when, and how robots access their web sites.

To control robot behavior, a web site administrator can put a special file named robots.txt on their web site. The robots.txt file contains instructions to robots that restrict which parts of the web site a robot may access. Well-behaved robots are required to download a web site’s robots.txt file before accessing the site, and comply with all restrictions expressed therein. (Please read the robots.txt specification for full details on the contents of the robots.txt file.)

Project Description

For this project you will write a program named robot-filter that can determine whether a web robot is allowed to download a particular URL (a simple yes/no question). Your program should support the following command-line syntax:

robot-filter <exclusion-file > <url-path>

robot-filter is the name of the executable program.

exclusion-file is the name of a local file that conforms to the robots.txt specification. This could be a robots.txt file that you downloaded from an actual web site, or one that you created by hand.

url-path is the URL in question. Is a robot allowed to download this URL, or not? It is called a “URL path” because it only includes the part of the URL that comes after the http://hostname:port part. This is convenient because URL prefixes in robots.txt files are represented in this same format.

To determine if the URL may be downloaded, your program should use the exclusions specified in the default record of the exclusion file. The default record is the record containing User-agent: *.

Your program should perform the following steps:

1) Retrieve the exclusion file name and URL path from the command-line

a. If the command-line arguments are invalid (too many or too few), display an appropriate error message and exit

2) Open the exclusion file

a. If the file cannot be opened, display an appropriate error message and exit

3) Parse the exclusion file, looking for the default record

a. If the exclusion file does not contain a default record, all URLs should be considered accessible

4) Parse all of the exclusions in the default record and store them in a data structure

5) Search the exclusion list to determine if the specified URL path may be accessed by a robot

a. If the answer is affirmative, display “YES” and return 0 from main

b. If the answer is negative, display “NO” and return -1 from main

6) If the program fails for any reason (invalid command-line arguments, can’t open exclusion file, etc.), return a non-zero value other than -1 from main to indicate failure

Restrictions

You are not allowed to use the C++ string class or the classes and functions from the Standard Template Library (vector, list, map, set, etc.). Specifically, the following header files may not be used:

<string> (but you are allowed to use <string.h>)
<algorithm>
<deque>
<list>
<map>
<queue>
<set>
<stack>
<vector>

Additional Notes

Many robots.txt files on web sites contain fields other than User-agent: and Disallow:. Your program should ignore such non-standard fields.

Some robots.txt files contain case errors in field names (e.g., User-Agent: instead of User-agent:). Your program should handle such case errors by doing case-insensitive string comparisons.

Be sure to handle line termination precisely as described in the robots.txt specification. A line may be terminated in any of three ways: ”\r”, ”\r\n”, or ”\n”. Your program must handle all three cases correctly.

Be sure to handle comments precisely as described in the robots.txt specification.

1) A ’#’ character marks the beginning of a comment. All whitespace directly preceding the ’#’ and the remainder of the line should be discarded

2) Lines containing only a comment (possibly preceded by whitespace) should be discarded completely, and do not indicate a record boundary

Implementation Advice

If you will be doing Web Crawler or Web Cloner this semester, your robot filter code will become part of your first big project. To make your robot filter code more reusable, it is recommended that you implement your robot filter as a C++ class. For example, you might write a class with a public interface similar to the following:

class RobotFilter {

public:

RobotFilter() { … }

bool LoadExclusionFile(istream & input) { … }

bool IsUrlPathExcluded(const char * urlPath) { … }

};

Such a class can be easily incorporated into a later project. Whatever you do, realize that your robot filter code will be reused, and doing a good job on it now will save you work later.

Passing Off

To pass off your program, find a TA in the southwest corner of the TMCB basement (1058 TMCB). They will run a set of test cases on your program to verify that it works properly. You may run these test cases yourself before passing off by running the following command from the directory containing your robot-filter executable:

~cs240ta/bin/robotest