Computer Science 235

Spelling Checker


Note: Projects are to be completed by each student individually (not by groups of students).

The spelling checker reads the words in a document and outputs a list of misspelled words. For each misspelled word, the program outputs the word followed by a list of the line numbers where the word is found in the document.

The program decides which words are misspelled by using a dictionary of words. The program searches the dictionary for each word in the document. If a word from the document cannot be found in the dictionary, the word is added to the list of misspelled words.

Example Inputs

The spelling checker is given a dictionary and document to be checked as inputs.

Dictionary

if
is
in
of
be
bugs
the
then
them
must

Document

If debugging is the
process of removing bugs.
Then programming must be the
process of putting them in.

Example Output

The spelling checker outputs the misspelled words that were found in the document.

debugging: 1
process: 2 4
programming: 3
putting: 4
removing: 2

Testing

Here are some ideas for tests.

  1. An empty dictionary file.
  2. An empty document file.
  3. A document with no misspelled words.
  4. A misspelled word repeated multiple times on one line.
  5. A dictionary that contains upper-case letters.
  6. A document that contains hyphens or other punctuation.

Dictionary File Format

The spelling checker uses a dictionary of words. The dictionary is a text file that contains a list of words, one word per line. When comparing words in the document with words in the dictionary, use a comparison that is not case-sensitve.

Document Files

The document to be checked is just a text file. The spelling checker reads the file to be checked and extracts each word from the file. What's the definition of a word in the document file? Words are defined as sequences of letters ('a-z' and 'A-Z') that are separated by characters that are not letters.

Each word in the document is checked against the words in the dictionary in order to detect misspelled words. If a word is not found in the dictionary, it is considered to be misspelled. When comparing words in the document with words in the dictionary, use a comparison that is not case-sensitve.

A document file might contain English prose, or it could just be a list of words. The program does the same thing in either case: 1) extracts the words in the file and 2) checks to see if they're in the dictionary.

Output Format

The output of the spelling checker is a sorted list of misspelled words, one word per line.

Each misspelled word is printed in lower case and followed by a colon and a space-separated list of the line numbers where the word is found in the document. The line numbers are given in increasing order.

The output is written to a file, not to the standard output.

Implementation Requirements

  1. Store the dictionary in a Set.
  2. Store the misspelled words in a Map where the key is a word and the value is a List of line numbers.
  3. Store the line numbers for a given misspelled word in a List.
  4. Use a compare that is not case-sensitive when comparing words in the document with words in the dictionary.
  5. Output the misspelled words in sorted order.
  6. Output the misspelled words in lower case.
  7. Your implementation needs to run on large files in a reasonable amount of time.

Command Line

The program is run with the names of the dictionary, document, and output files given on the command-line. For example the program might be run like this:

lab2 dictionary.txt document.txt misspelled.txt