Hash Tables (9.3) BSTs provide Set operations (add/remove/find) in O(log n) time. Hashing is able do the operations in O(1) time in some cases. How can you find something in a large set in constant time? Suppose you need to create a set of employee ID numbers. Each ID number is between 1 and 1000. How can you store the numbers so you can add/remove/find in O(1) time? add(id) remove(id) find(id) Could you use this approach to store larger numbers? Suppose you are storing only 1000 different numbers, but the numbers range in value from 1 to 1000000. Would you make an array of size 1000000? Map the large numbers into numbers between 0 and 999. 23456 168421 Could you use this approach to store items (like Strings) that are not numbers? Convert the strings to numbers. "421" "abba" What's a Hash Table? an array used to store a collection of items the items are stored at the index given by a hash function What's a Hash Function? a function that maps an item into a number the number is in the range of a valid array index the function usually contains a mod by table size (x % tableSize) What's a Collision? when two or more items hash to the same number resolve with chaining, linear probing, or quadratic probing Chaining How do you resolve collisions using Chaining? the hash table is an array of linked-lists each list contains the items that hash to the same index Show the result of inserting 2, 7, 3, 8 into a hash table of size 5. How do you find an item in a hash table using Chaining? use the hash function to select a list in the table sequentially search the list for the item Show the result of finding 8 and 12 in the hash table. Classwork You may work with a partner. Show the result of inserting 7, 6, 1, 2, 9 into a hash table of size 4. Performance Analysis of Chaining What's the Load Factor of a hash table? the number of items in the table divided by the table size LoadFactor = n/m n is the number of items in the table m is the size of the array used for the table How does the Load Factor affect the performance of a hash table? How many compares are needed (on average) to find an item in a chaining hash table? Compares = 1 + LoadFactor/2 (for a successful search)(Knuth, vol 3) (the average number of items in each bucket is L) 1.25 compares when Load Factor is 0.5 1.50 compares when Load Factor is 1.0 2.00 compares when Load Factor is 2.0 What's a good Load Factor to use with chaining? 1.0 lower doesn't significantly improve performance higher can save space What's the Big-Oh bound of add/remove/find for a chaining hash table? average-case is O(n/m) (items distributed evenly over the buckets) worst-case is O(n) (all items hash to the same bucket) Which is better Hashing or AVL trees? hashing is O(1) (average case) trees are O(log n) (worst case) hashing is simpler trees keep data sorted Hash Functions What are the characteristics of a good Hash Function? 1. gives a number between 0 and tableSize-1 2. easy and fast to compute 3. distributes items evenly throughout the hash table Would it make sense to use random numbers in a hash function to help ensure an even distribution of items? What's a typical hash function for integers? value % tableSize A hash function for Strings must convert a String to a number. What's wrong with this code for converting a String to a number? How large can hashCode be for a string of 10 characters? What's true about hashCode for two strings like "bat" and "tab"? unsigned hashCode( string item, unsigned tableSize ) { unsigned hashCode = 0; for (unsigned i = 0; i < item.length(); i++) hashCode = hashCode + item.at(i); return hashCode % tableSize; } How is this hash function better than the previous one? unsigned hashCode( string item, unsigned tableSize ) { unsigned hashCode = 0; for (unsigned i = 0; i < item.length(); i++) hashCode = hashCode * 31 + item.at(i); return hashCode % tableSize; }