Web Cloner Project Specification

 

The goal of this project is to build a Web Cloner.  A Web Cloner is a program that can be used to make a copy of a remote web site on a local file system.  The ability to copy a web site to a local file system enables users to browse the web even when they’re not connected to the Internet.  For example, before leaving on a business trip, an employee might use the Web Cloner to copy a web site to their laptop so that they can browse the site while on an airplane.

Background

This project requires that you have a basic understanding of HTML.  If you do not already know HTML, a good introduction to HTML can be found at http://www.w3c.org/MarkUp/Guide/.

Command-Line

Your program should accept the following command-line arguments:

 

cloner root-url output-directory

 

Your executable file must be named cloner (all lower-case).  The root-url is the URL of the web site that is to be copied.  The output-directory is the name of the directory on the local file system to which the web site should be copied.  For example, the following command-line would copy the CS Department web site to a local directory named /web-sites/byu-cs :

 

cloner http://www.cs.byu.edu/index.html /web-sites/byu-cs

 

After your program completes, the user should be able to point their web browser at

output-directory/index.html to browse the local copy of the cloned web site.

 

The output directory should already exist before the program is invoked.  If the directory does not exist, your program should print an informative error message and terminate.

Web Cloner Operation

Your program should download the root page of the web site and store it in a file named index.html in the output directory.  Either during or after the download of the root page, you should parse its contents and find all of the links to other documents.  All of the linked documents also need to be downloaded and stored in local files.  Unlike the root page, you are free to use any file names that you want for storing the local copies of the linked documents.  In order to keep the links in the local copy of the root page valid, you will also need to modify the links in it to reference the local files instead of the original documents on the web site.

 

The same basic algorithm should then be applied to all of the files linked to from the root page.

  1. Download the file
  2. Parse out the links in the file
  3. Update the links to point at the local files that will contain the local copies of the linked documents
  4. Download the linked documents to the appropriate local files, and apply the same algorithm to each of them

 

Your program should continue until all of the documents on the web site have been downloaded to the output directory and the links between the files have been rewritten to reference the local copies.  You will also need to make sure that your program handles cycles in the links so that it does not get into an infinite loop.

 

Although it is natural to think about web cloning as a recursive process, a recursive implementation of this algorithm is probably not ideal.  Links on a web site can go arbitrarily deep, which could make a recursive implementation susceptible to runtime stack overflow.  More importantly, a recursive implementation could be very confusing to debug.  Instead of using recursion, consider using a FIFO queue to store the documents that still need to be downloaded and processed.

Local File Structure

The goal of the Web Cloner is to create a local copy of a web site that is suitable for local browsing.  Your program must copy the various files that make up the web site to the local output directory, and link the local files together so that the links between the pages are functional.  Many of the files on a web site are HTML files, but there will also be many files containing images and other kinds of documents that also need to be copied.

 

The files on the web site being copied will usually be organized in a complex directory structure that lends itself to development and maintenance of the web site.  Although the local copy of the site must behave just like the real site, there is no requirement that the files in the local copy be structured in the same way as the files on the real site.  For example, it is perfectly acceptable to store all of the files for the local copy in the output directory itself without creating additional subdirectories.  Although it would be possible to mimic the file organization of the real site in the local copy, doing so is much more difficult than simply placing all of the local files in the same directory.  Since the local copy is intended for browsing and not for web site maintenance, placing all files in the same directory is not a problem.

Finding Links

Your program needs to be able find links in HTML documents and process them appropriately.  Links in HTML documents are stored as attribute values on various kinds of tags.  While many HTML tags can contain links to other documents, for this project we will only be concerned with the following kinds of links:

 

§         <A>, <LINK>, and <AREA> tags have an attribute named HREF that contains the URL of a linked document

 

§         <IMG> tags have an attribute named SRC that contains the URL of an image file

 

§         <FRAME> tags have an attribute named SRC that contains the URL of the document that contains the frame’s contents

 

When parsing HTML documents, your program should look for <A>, <LINK>, <AREA>, <IMG>, and <FRAME> tags that have link attributes as previously described.  All other HTML tags should be ignored.  Your program should also skip over HTML comments, which begin with <!-- and end with --> .

 

Attribute values inside HTML tags can be delimited with double-quotes, single-quotes, or no quotes at all.  Your program should handle all of these cases.  For example, all of the following are valid HTML tags:

 

<A href=http://www.cnn.com/>

<A href=’http://www.cnn.com/’>

<A href=”http://www.cnn.com/”>

 

Your program should also properly handle whitespace characters when parsing HTML tags.  For example, all of the following are valid HTML tags:

 

<A href   =http://www.cnn.com/   >

<A href=   ’http://www.cnn.com/’>

<A

   href   =

   ”http://www.cnn.com/”

>

 

In HTML, the names of tags and attributes are case insensitive.  For example, the following HTML tags are equivalent:

 

<a href=”http://www.cnn.com/”>

<A HREF=”http://www.cnn.com/”>

<a HREF=”http://www.cnn.com/”>

Many HTML files on the Internet have small syntax errors in them.  While these files are technically incorrect, all web browsers are quite forgiving in ignoring such errors and they try to display such files the best that they can.  If your program is to be robust enough to work on real web sites, it will also need to be as forgiving as possible when it encounters files containing invalid HTML.  For the Web Cloner, this means that it would not be reasonable for the program to immediately terminate when it encounters an HTML file containing invalid syntax.  Although you might choose to skip over a file that contains invalid HTML syntax, it would be better to just skip over the invalid parts of such a file and try to process the parts that are valid.

Parsing Links

The links in HTML files are represented as URLs.  There are two kinds of URLs, absolute and relative.  Your program must handle both absolute and relative URLs, as described below.

Absolute URLs

An absolute URL fully specifies the complete address of the referenced document.  The general format of an absolute URL is:

 

<scheme>://<net_loc>/<path>;<params>?<query>#<fragment>

 

For example, consider the following absolute URL:

 

http://www.espn.com:80/basketball/nba/index.html;lang=engl?team=dallas#Roster

 

The parts of this URL are:

 

<scheme>      http

<net_loc>     www.espn.com:80

<path>        /basketball/nba/index.html

<params>      lang=engl

<query>       team=dallas

<fragment>    Roster

 

Your program must be able to parse absolute URLs according to the syntax defined above.  However, several parts of an absolute URL are optional, and will not appear in every URL.  Specifically, the <path>, <params>, <query>, and <fragment> parts might not be present.  If the <path> is missing, it is assumed to be /”.  If the <params>, <query>, and <fragment> parts are missing, they are empty.

 

The <scheme> and <net_loc> parts are case insensitive, but the <path>, <params>, <query>, and <fragment> parts are case sensitive.  This means that

 

http://www.cnn.com/index.html

HTTP://WWW.CNN.COM/index.html

 

are equivalent, but

 

http://www.cnn.com/index.html

http://www.cnn.com /INDEX.HTML

 

are not equivalent.

Relative URLs

Unlike absolute URLs, relative URLs do not fully specify the address of the referenced document.  Rather, they specify the address of the document relative to the address of the document containing the link (the base document).  Here are some examples of relative URLs:

 

<img src=”/images/nasdaq.jpg”>

<img src=”./images/nasdaq.jpg”>

<img src=”../../images/nasdaq.jpg”>

<a href=”#HEADLINES”>

<img src=”images/nasdaq.jpg”>

 

Before your program will be able to download a document whose address is specified by a relative URL, it will first need to convert (or resolve) the relative URL to an absolute URL that represents the document’s full address.  Resolving a relative URL is done by combining the absolute URL of the base document (the base URL) with the relative URL.

 

If the relative URL begins with a /”, the absolute URL is constructed by prepending the <scheme> and the <net_loc> from the base URL to the relative URL.  For example,

 

Base URL:             http://www.cnn.com/news/financial/index.html

Relative URL:        /images/nasdaq.jpg

Resolved URL:     http://www.cnn.com/images/nasdaq.jpg

 

If the relative URL begins with ./”, the URL is relative to the directory containing the base document.  For example,

 

Base URL:             http://www.cnn.com/news/financial/index.html

Relative URL:        ./images/nasdaq.jpg

Resolved URL:     http://www.cnn.com/news/financial/images/nasdaq.jpg

 

If the relative URL begins with ../”, the URL is relative to the parent directory of the directory containing the base document.  For example,

 

Base URL:             http://www.cnn.com/news/financial/index.html

Relative URL:        ../images/nasdaq.jpg

Resolved URL:     http://www.cnn.com/news/images/nasdaq.jpg

 

Note that a relative URL can begin with any number of  ../”, such as ../../../images/nasdaq.jpg.  Each ../” indicates that you should go up one more level in the directory hierarchy.

 

If the relative URL begins with #”, the URL is relative to the base document itself.  In this case, the fragment represents a specific location within the base document.  For example,

 

Base URL:             http://www.cnn.com/news/index.html?sym=WEST

Relative URL:        #HEADLINES

Resolved URL:     http://www.cnn.com/news/index.html?sym=WEST#HEADLINES

 

If the relative URL begins with something other than /”, ”./”, ”../”, or #” the URL is relative to the directory containing the base document (just like when it starts with ”./”).  For example,

 

Base URL:             http://www.cnn.com/news/financial/index.html

Relative URL:        images/nasdaq.jpg

Resolved URL:     http://www.cnn.com/news/financial/images/nasdaq.jpg

External Links

Most web sites contain links to documents on other external web sites.  Your program should only download links that are internal to the web site specified on the command-line, and not download links that are external to that site.  The algorithm for distinguishing between internal and external links is as follows:

 

(1)  Compute the prefix of the root URL.  This is constructed by removing the file name (if any) from the original root URL.  For example, if the root URL is:

 

http://www.cnn.com/news/financial/index.html

 

the prefix of the root URL is:

 

http://www.cnn.com/news/financial/

 

(2)  If the link’s absolute URL (after resolution, if it’s relative) starts with the prefix of the root URL, then the link is internal.  Otherwise, the link is external, and should not be downloaded.  For example, given the base URL above, the following URLs would be internal:

 

http://www.cnn.com/news/financial/markets/nasdaq.html

http://www.cnn.com/news/financial/overview.html

http://www.cnn.com/news/financial/images/gifs/nyse.gif

 

and these links would be external:

 

http://www.cnn.com/news/index.html

mailto:webmaster@cnn.com

ftp://ftp.cnn.com/news/financial/index.html

 

Notice that non-HTTP URLs such as mailto: and ftp: URLs are always external.

Processing External Links

If a link is internal to the site, it should be downloaded and processed.  If the link is external, the manner in which the link should be processed depends on what kind of tag contains the link.

 

If the external link appears in a <A>, <FRAME>, or <AREA> tag, instead of downloading the document referenced by the link, your program should generate a local HTML file containing the HTML code shown below, and rewrite the external link to point to the local file.

 

<html>

   <head>

      <title>External Link</title>

   </head>

   <body>

      <h1>External Link</h1>

      <h1><a href=”{external-url}”>{external-url}</a></h1>

   </body>

</html>

 

where both occurrences of {external-url} should be replaced with the URL of the external link.

 

By generating these “External Link” pages, the user will be notified whenever they try to follow an external link.  This could be very helpful if they are browsing off-line.  When the user finds themselves on one of these “External Link” pages, they can make an explicit decision to follow or not follow the external link.

 

If an external link appears in a <IMG> or <LINK> tag, just leave the external link as it appears in the original file.  As with all external links, there is no need to download the external document.

HTML vs. Non-HTML Documents

Your program should download all of the web site’s files, including both HTML and non-HTML documents.  However, it is only necessary to parse HTML files because they are the only files that may contain links.  Non-HTML files should be downloaded, but no further processing of these files is necessary.

 

Your program will need distinguish between HTML and non-HTML files.  For the purposes of this project, a file is considered to be HTML if any of the following conditions hold:

 

(1)   The <path> part of the URL is a directory name (i.e., it ends with /”).  For example, http://www.espn.com/football/

 

(2)   The file name in the URL’s <path> does not end with a file extension (i.e., the file name contains no periods).  For example,  http://www.espn.com/football/scores

 

(3)   The file name in the URL’s <path> ends with one of the following extensions:  .html, .htm, .shtml, .cgi, .jsp, .asp, .aspx, .php, .pl, .cfm.  (You may add other file extensions to this list if you want.)  For example,  http://www.espn.com/football/scores/index.html

 

In the case of a relative URL, the URL should be resolved to an absolute URL before deciding whether the URL represents an HTML document.

Download Errors

Your program is likely to occasionally encounter errors when downloading documents from the web.  These errors may be due to invalid links that reference non-existent documents, or due to intermittent problems being experienced by the Internet or the web site.  Your program should deal with download errors in a robust fashion.  At a minimum it must print out a message stating that it failed to download the document at a particular URL and skip over the document.  An even better approach would be to retry the download a limited number of times with some period of time in between attempts in the hope that conditions will improve and the download will eventually succeed.

Efficiency and Data Structure Selection

Although your program will usually be run on fairly small web sites, it must be designed so that it will perform well even when it is run on large web sites.  You should assume that the internal data structures of your program could grow to be very large, and therefore you should choose data structures that support fast insert and search operations even when they become very large.

Limitations

The Web Cloner as defined in this specification will not handle all web sites found on the Internet.  We have made a number of simplifying assumptions that make it possible to complete this project within the allotted time period.  Your program need not handle anything that is not specifically mentioned in this specification.  However, you are certainly welcome to enhance your program so that it can handle additional HTML constructs and kinds of web sites.


Project Requirements

Your program must implement all of the features described in the previous sections.  In addition to the base functionality, you must also complete the following requirements.

Design Document

The first step in developing a larger program like the Web Cloner is to spend some time understanding the problem that is to be solved.  Once you understand the problem, you can start to design the classes that you will need by simulating the operation of the program in your mind, and creating classes that perform each of the functions required by the program.  For each class that you create, you should document what responsibilities the class has, and how it interacts with the other classes in your design to perform its responsibilities.  This exercise will help you determine what classes you need to write, and how those classes work together to produce a working program.  Doing this type of design work before you start coding will save time because it will help you avoid spending effort on dead-end ideas that don't work.

 

Once you've thought through your design the best that you can without writing any code, you should make a first attempt at implementing your design.  As your implementation progresses, you will probably find that your design was incomplete or faulty in some respects.  This is to be expected because some insights only come from actually implementing a design.  As you proceed, make necessary changes to your design, and incorporate them into your code.

 

To encourage you to follow this type of design process, you will be required to submit a design document for your program. Your design document must include three things:

 (1) A DETAILED description of the data structures that you will use to store the program's data.  Describe in detail what data needs to be stored and how it will be stored (e.g., binary trees, hash tables, queues, arrays, linked-lists, etc.).  Also explain why you chose the data structures that you did.

(2)  For each class in your design, document the following:

§          The name and purpose of the class

§          The name and purpose of each data member

§          The name and purpose of each method.  For each method, document each of its parameters and its return value.

The easiest way to document this information is to create a commented header file (.h file) for each of your classes using the style described in the Code Evaluation section.  Since you will have to create commented header files anyway, using this format for your design document will help you get a head start on your code.

(3)  A DETAILED description of how your program will work.  Describe how the objects in your design will work together to implement the web cloning function.  Describe the flow of your program from beginning to end, including initialization and the web cloning process itself.  Explain the core algorithms of your program, including how control will flow from one method to another.  Explain how you will handle the various error conditions.

You must turn in a hard-copy printout of your design document to the TAs before midnight on the due date.  Please make sure that your name is clearly visible on the front of your design document.  Design documents may not be submitted by email.

Web Access

Your program will obviously need to download web documents.  The CS 240 Utilities provide several classes that you are required to use to download web documents (URLConnection, InputStream, HTTPInputStream).  Be aware that these classes throw exceptions when errors occur, and your program must handle these exceptions (i.e., it can't just crash due to an unhandled exception).

Exception Handling

Your program must properly handle C++ exceptions. This means that it must not abnormally terminate due to an unhandled exception.

Memory Leaks

Your program must not have memory leaks.  All memory on the heap must be freed before the program terminates.

 

There is a Linux tool named valgrind that can be used to check a program for various kinds of memory management errors, including memory leaks.  The TAs will use valgrind to check your program for memory leaks.  You should also use valgrind while developing your program to find and remove errors.

 

To use valgrind, you should first compile and link your program using the –g flag, which tells the compiler and linker to include debugging information in your executable.  After doing that, valgrind may be executed as follows:

 

valgrind –-tool=memcheck --leak-check=yes --show-reachable=yes executable param1 param2 …

 

Valgrind will print out messages describing all memory management errors that it detects in your program.  This is a valuable tool when debugging your program.   After your program completes, valgrind will print out information about any heap memory that was not deallocated before the program terminated.  You are required to remove all memory leaks before passing off.  You are not responsible for memory leaks in the standard C++ library or other libraries on which your program depends.  For example, the C++ string class allocates memory on the heap which it never deallocates.  Valgrind will report this as a memory leak, but you are not responsible for it.  You are only responsible for memory allocated directly by your program.

 

Please note that running your program with valgrind will cause your program run much slower than usual.  Therefore, you might not want to always run your program with valgrind.  Rather, run valgrind periodically during your program’s development to find and remove memory errors.  It is not recommended that you wait until your program is completely done before running valgrind.

Standard Template Library (STL)

One of the learning objectives of this project is to give you significant experience with pointers and low-level memory management in C++.  Since the STL largely relieves the programmer of these responsibilities, you are not allowed to use the STL on this project.  Specifically, the following header files may not be used:

  • <algorithm>
  • <deque>
  • <list>
  • <map>
  • <queue>
  • <set>
  • <stack>
  • <vector>

Unit Tests

Every class/struct in your program must have a public method with the following signature

 

static bool Test(ostream & os);

 

that will automatically test the class/struct and verify that it works correctly.  The Test method on a class/struct should create one or more instances of the class/struct, call methods on those objects, and automatically check that the results returned by the methods are correct.  If all tests succeed, Test should return true.  If one or more tests fail, Test should return false.  For each test that fails, an informative error message that describes the test that failed should be written to the passed-in output stream.  This will tell you which tests failed so that you can fix the problems.

 

You must also write a test driver program that runs all of your automated test cases.  This program will be very simple, consisting only of a main function that calls all of the Test methods on your classes/structs.  Whenever you make a change to your classes/structs, you can recompile your test program and run it to make sure that the new code works and that it didn't break anything that was already there.

 

The file UnitTest.h in the CS240 Utilities contains code that is useful for creating automated test cases.

Static Library

You will be required to create a static library containing the CS 240 Utilities classes, and to link this library into your program.  In class we will teach you how to create static libraries using the Linux ar command, and how to link them into a program.  When you pass off your program, the TA will ask you to un-tar your source code and build your program.  Creating and linking the static library will be part of the build process.  Of course, you will also be asked to demonstrate that your program works.

Make File Automation

You will be required to automate the compilation and testing of your project using a make file.  Your make file should support the following functions:

  • build the static library containing the CS 240 Utilities classes
  • build the executable cloner program
  • compile and run your automated unit test cases
  • remove all of the files created by the build process

Your make file must recognize the following targets:

Target

Example

Meaning

lib

$ make lib

Compile the CS 240 Utilities classes, and package them into a static library named libcs240utils.a

bin

$ make bin

Compile and link your program.


This target depends on the lib target.

test

$ make test

Compile and run your automated test program.
The test program should contain a main function that simply calls the Test methods on all of your classes. If all tests succeed, the program should print out a message indicating success. If one or more tests fail, the program should print out a message indicating failure, and also print out a message describing each test that failed.

clean

$ make clean

Remove all files and directories created by the other targets. This returns the project directory to its original state.

Code Evaluation

After you have submitted and passed off your program, your source code will be graded on how well you followed the good programming practices discussed in class and in the textbook. The criteria for grading your source code can be found at the following link:


Code Evaluation Criteria

Important Advice

This is a big project. It will take the average student 50-60 hours to complete. You have 5 weeks to complete it, and you are strongly encouraged to get started immediately and work at a steady pace for the entire 5 weeks until you are done. If you procrastinate, it is highly unlikely that you will finish on time. Although you can pass it off late, every additional day you spend working on this project will be one less day that you have to work on the next project, which is approximately the same size as this one. Your success in this class will largely depend on how diligently you follow this advice.

Submitting Your Program

Create a gzip-compressed tar file containing all of your project's source files and directories. The name of the compressed tar file must have the following format:

 

firstname_lastname.tgz
 

where firstname should be replaced with your first name and lastname should be replaced with your last name. For example, if your name is Bob White, you would use the following command to create your tar file:

 

$ tar czf Bob_White.tgz webcloner-project

 

(this assumes that your project files are stored in a subdirectory named webcloner-project).

NOTE: In order to minimize the size of your tar file, please make sure that it only contains C++ source files, make files, and any data files that are needed by your test cases.  It should not include .o files, .a files, executable files, etc.  If your tar file is larger than 500kb, you will not be allowed to submit it.  If you follow these instructions, your tar file should be much smaller than 500kb.  If it is bigger than 500kb, you need to delete some files and recreate the tar file.

After you have created your tar file, click on the following link to go to the project submission web page:

Web Cloner Project Submission Page

After you provide your CS account login name and password and the name of your tar file, this page will upload your tar file to our server.

After submitting your tar file, in order to receive credit you must also pass off your project with a TA. When you pass off, you will be asked to download your tar file from our server using the project retrieval page, which is located at the following link:

Web Cloner Project Retrieval Page

After downloading your tar file, you will be asked to:

  1. un-tar your tar file
  2. compile and link your program, including the static library
  3. demonstrate that your program works (this includes showing that your program has no memory leaks)

If the TA finds problems with your program, you will need to fix your program, submit a new tar file containing your modified program, and then pass it off again with the TA.

Your tar file must be submitted before the deadline in order to receive full credit, but you may pass off your project with a TA after the deadline. If you do this, you run the risk that the TA might find problems with your program, and you will need to resubmit your tar file after the deadline. In this case, the fact that you submitted your first tar file before the deadline will not help you. The time of your last tar file submission will be used to compute your grade.