Hash Tables

Note: Projects are to be completed by each student individually (not by groups of students).

In this project you will implement the Set abstract data type using a Hash Table.

The program is given a list of commands as input. The program runs each command on a Hash Table Set and reports the result of each command to the output.

Example Input

clear
add bob
add joe
add jim
print
find joe
remove bob
remove joe
remove jim
print
find joe

Example Output

clear
add bob
add joe
add jim
print
hash 0: joe
hash 1: bob
hash 2: jim
find joe true
remove bob
remove joe
remove jim
print
find joe false

Testing

Here are some ideas for tests.

Add that rehashes to a larger table.
Add a duplicate item.
Remove that rehashes to a smaller table.
Remove an item that is not in the table.

Commands

The following commands can be given in the input file. Each command is given on one line in the input file. Each command has an output that is written to the output file.

Clear

The clear command has the form:

clear

The clear command removes all the items from the set. The set is empty after this operation completes.

The output for the clear command has the form: (The output is the same as the input.)

clear

Add

The add command has the form:

add item

The add command adds 'item' to the set if the item is not already in the set. If the set already contains the item, the add command leaves the set unchanged. However the command is still output and processing continues with the next command. The 'item' parameter is a string of non-whitespace characters. The 'item' parameter cannot contain any whitespace characters (space, tab, newline) because these characters are used to separate parameters.

The output for the add command has the form: (The output is the same as the input.)

add item

Remove

The remove command has the form:

remove item

The remove command removes 'item' from the set if the item is present in the set. If the set does not contain the item, the remove command leaves the set unchanged. However the command is still output and processing continues with the next command. The 'item' parameter is a string of non-whitespace characters.

The output for the remove command has the form: (The output is the same as the input.)

remove item

Find

The find command has the form:

find item

The find command searches for 'item' in the set. The 'item' parameter is a string of non-whitespace characters. The result of the find command is the string 'true' if the item is found and the string 'false' if the item is not found.

If 'item' is found in the set the output for the find command has the form:

find item true

If 'item' is not found in the set the output for the find command has the form:

find item false

Print

The print command has the form:

print

The print command outputs the items stored in the set. The items are output by traversing the hash table from the first bucket through the last bucket. The items in the first bucket are output first, followed by the items in the second bucket, then the items in the third bucket, etc. The items in a given bucket in the table are output on the same line of the output (unless there are too many items as described later). The items on a line are output in the same order that they were added to the bucket.

The output for the print command has the form:

print
hash 0: item1
hash 1: item2 item3
...

Each line of output gives the items stored in the same bucket in the Hash Table. Each line of output has the form:

hash x: item1 item2 ...

where 'x' is the index of the bucket in the table. The items stored in bucket 'x' in the table are output following the 'hash x:' prefix. Each item is separated from the previous output by a single space character. The items in a bucket are output in the same order that they were added to the bucket.

Do not output more than eight items on one line. If a bucket has more than eight items, output multiple lines for that bucket, with the first eight items on the first line, the next eight items on the next line, etc. Re-output the 'hash x:' prefix for each line of output that is needed for a given bucket. For example the output for 10 items in bucket 4 could look like this:

hash 4: item1 item2 item3 item4 item5 item6 item7 item8
hash 4: item9 item10

The Table

The Set must be implemented using a Hash Table that uses chaining to resolve collisions. At a minimum, a Hash Table should contain:

A pointer to the array used for the hash table.
The current size of the array used for the hash table.
The count of how many items are stored in the hash table.

The initial size of the hash table array is zero.

The Buckets

The buckets in the table must be implemented using Lists. Use your code from the List project to provide the Lists for the buckets in the hash table. You are not allowed to use any containers from the standard library.

The Hash Function

The Hash Table needs to be able to use a Hash Function that is designed specifically for the type of data stored in the table. If strings are stored in the Set, the Hash Table needs to use a Hash Function for strings. If Students are stored in the Set, the Hash Table needs to use a Hash Function for Students.

An easy way to allow the table to use the right Hash Function for the type of data in the table is to write a global 'hashCode' function for each kind of item. The functions are global because they are not defined inside of any class declaration. It is important that all the 'hashCode' functions use the same name.

The 'hashCode' function for a string could look like this:

unsigned hashCode( const std::string& s ) {

  ...

}

The 'hashCode' function for a Student could look like this:

unsigned hashCode( const Student& s ) {

  ...

}

Whenever the Hash Table needs to know the result of the Hash Function for an item in the table it calls the 'hashCode' function. For example, the code could look like this:

unsigned index = hashCode(item);

If 'item' is a string this will call the 'hashCode' function that takes a string as a parameter. If 'item' is a Student this will call the 'hashCode' function that takes a Student as a parameter. For this project you only need to write a 'hashCode' function for strings (you don't need a 'hashCode' function for students).

Hash Function for Strings

You must use the following hash function for strings.

initialize hashIndex to zero
(use 'unsigned' as the type for 'hashIndex')

for each character c in the string
  multiply hashIndex by 31
  add the ASCII code for c to hashIndex
end for

mod hashIndex with the hash table array size

For example, the lower case letters 'b' and 'o' have ASCII codes 98 and 111 respectively. The hash function applied to the string "bob" would give:

(((0 * 31 + 98) * 31 + 111) * 31 + 98)

The formula evaluates to 97717. If the size of the hash table array is 3, the number 97717 is modded with 3 to give an index of 1.

97717 = 32572 * 3 + 1

Note that the type of 'hashIndex' must be an unsigned integer because the computation that combines all the characters in the string may cause 'hashIndex' to overflow. When a signed integer overflows it may become a negative value. The result of a 'mod' operation on a negative value may be a negative value. You can't use a negative value as a hash table index because such an index would be out of bounds. Using an unsigned integer type for 'hashIndex' will ensure that it never holds a negative value.

Note that the 'hashIndex' must be a 32-bit unsigned integer type.

Adding

If the item to be added is a duplicate, do not modify the table.

If the table is already full, rehash to a larger size before adding the new item (see the Rehashing section below).

Use the Hash Function to find the index of the bucket to which the item should be added.

Add the new item to the end of the list of items that previously hashed to the selected bucket.

The 'add' operation must run in average-case O(1) time.

Rehashing

When the hash table becomes either too full or too empty, the items in the table must be rehashed into a new table of a different size. Each item needs to be 'hashed' into the new table because changing the size of the hash table array causes the hash function to change.

The items are rehashed by traversing the old hash table from the first bucket through the last bucket. The items in the first bucket are rehashed first, followed by the items in the second bucket, then the items in the third bucket, etc. The items in a given bucket are rehashed in the same order that they were added to the bucket.

For example, if the following table needs to be rehashed, "bob" would be rehashed first, "zed" would be rehashed second, and "ned" would be rehashed last.

hash 0:
hash 1: bob
hash 2: zed ned

Rehashing to a Larger Table

When an item is to be added to the table and the table is already full, the items in the table must be rehashed into a larger table before the new item is added. The table is full when the number of items stored in the table is equal to the size of the hash table array. (The table is full when the load factor of the table is 1.0.)

The next larger size for the hash table array is two times the size of the current hash table array, plus one more.

new_array_size = old_array_size * 2 + 1

For example, the following table is full. The number of items in the table (three) is equal to the number of buckets in the table (three). If another item "zed" is added, the table must be rehashed before "zed" is added. The size of the new table will be seven (3 * 2 + 1).

hash 0: joe
hash 1: bob
hash 2: jim

Rehashing to a Smaller Table

When removing an item from the table causes the table to be less than half full, the items in the table must be rehashed into a smaller table after the item is removed.

The next smaller size for the hash table array is half the size of the current hash table array.

new_array_size = old_array_size / 2

Note that this is equivalent to subtracting one before dividing the old size in half because the sizes are always odd and the division is an integer division.

new_array_size = (old_array_size - 1) / 2

For example, the item "zed" was just removed from the following table leaving the table less than half full. The number of items in the table (three) is less than half the number of buckets in the table (seven). The table must be rehashed to a smaller size. The size of the new table will be three ((7 - 1) / 2).

hash 0:
hash 1:
hash 2: joe
hash 3:
hash 4: bob
hash 5:
hash 6: jim

Removing

If the item to be removed is not in the table, do not modify the table.

Use the Hash Function to find the index of the bucket from which the item should be removed.

Remove the item from the selected bucket.

If removing the item causes the number of items in the table to be less than half the size of the hash table array, rehash the table to a smaller size.

The 'remove' operation must run in average-case O(1) time.

Finding

Use the Hash Function to find the index of the bucket in which the item may be found.

Sequentially search the selected bucket starting with the first item that was added to the bucket.

The 'find' operation must run in average-case O(1) time.

Using the Standard Library

You must implement the Hash Table Set using only arrays, pointers, and objects. You are not allowed to use data types such as vector, list, set, map, etc from the standard library.

Implementation Requirements

The implementation must store the items in the set in a hash table and the hash table must use chaining to resolve collisions.
Use zero for the initial size of the hash table array.
The 'clear' operation must set the hash table array size back to zero.
When rehashing to a larger size, use a new array of size:
```
old_hash_table_array_size * 2 + 1
```
When rehashing to a smaller size, use a new array of size:
```
(old_hash_table_array_size - 1) / 2
```
When printing or rehashing, iterate through the table from the first bucket to the last bucket and within each bucket from the first item added to the last item added.
When you add an item to a bucket in the hash table, add the new item to the end of the chain of items that previously hashed to that bucket.
Use your own code from the List project to provide the Lists for the buckets in the hash table.
The 'add', 'remove', and 'find' operations must run in average-case O(1) time.
You are not allowed to use data types such as vector, list, set, map, etc from the standard library.
Operations that remove items from a bucket such as 'clear' and 'remove' must 'delete' any nodes that are no longer used in the bucket.
Operations that cause the table to rehash to a new size must 'delete' the old hash table array that is no longer used.
The program must pass a memory leak test such as valgrind.
The implementation of the set must use a C++ template so that the set is able to store objects of any valid data type.
Write your own code for the hash table. Do not use code from the book, the internet, or class notes.

Command Line

The program is run with the names of the 'Command' and 'Output' files given on the command-line. For example the program might be run like this:

lab7 command.txt output.txt

When the program is run this way the program runs the commands given in the 'command.txt' file and writes the output from the commands to the 'output.txt' file. Note that the names given on the command line may not always be 'command.txt' and 'output.txt'.

Computer Science 235