Hash Table
- Solves two problems when storing in Binary Search Trees
- Cannot store data that is not comparable
- Insert and search are O(logn)
- Utilizes the constant time access of Arrays
- Process
- Data -[Hash Function]→ Hash Code -[Reduce]→ Index
- Reduce(Hash): Hash % Buckets
- Insert data at index
- If there is other data at that index, then use a collision resolution mechanism
- Hash Maps are a Hash Table that is not thread-safe
Hash Functions
- Converts a data object to a hash code
- Properties
- Necessary
- Input: Object x
- Output: An integer representation of x
- If x = y, H(x) = H(y)
- Best if
- If x != y, H(x) != H(y)
- Evenly distributed
- Easy to compute
- Collisions = when two unique inputs have the same hash
- Possible whenever the Pigeonhole Principle applies
- That would always apply because giving every input a unique output would lead to overflows
- Ex: Strings
- H(x): (31 ^ char idx) * ascii value
- Prime numbers prevent collisions because they prevent grouping when a modulo reduce function is applied
Terminology
- Buckets
- Total slots in the hash table structure
- Load factor
- elements / buckets
- Elements is the number of elements, not rows (multiple elements in one row still increases the factor)
- When surpasses a certain value, then move to a larger table using rehashed values
- Hash stays the same, but the reduce function will adapt to the larger table
- Happens when that is hit, not on the next insert
- Is normally around 0.75
Collision Resolution Strategies
Separate Chaining (Closed Addressing)
- Buckets store linked list, collisions are appended to the list
- This can be optimized by using a binary search tree instead of a linked list
- When searching, find the bucket that contains the item and then search the linked list for the element
- Allows for a load factor > 1, though that is not preferable
- Less commonly followed because the data is no longer stored contiguously, which slows down access times
Open Addressing (Closed Hashing)
- All entries are directly in a bucket
- Collisions are added to empty spot dependent on probing type
- Linear Probing: next empty spot
- Quadratic Probing: try hash + n2 each time one is already filled… (prevents clustering)
- When searching, follow the probing type until you find the element
- Experimentally faster because contiguous memory access is faster
- Index is not directly determined by the hash code (ie. index is open)
- Finding after deletion is complicated
- Because an element could have been shifted and then the first occupant was deleted, making it look like the element is not in the table
- Solution: Statuses
- Each bucket gets a status of never used, deleted, occupied
- If you check a bucket at an index for an element and it was never used, the element is not in the table
- If the status was the other two, you should probe until you either see a never used or find the element
- We count each failed attempt as another collision (so if the probing function moves twice it counts as 2 collisions)
- Search/Index/Delete
- Best/Average case: O(1)
- Worst case: O(n)
- Everything is mapping to the same index (bad hash function)