Hash Table

  • Solves two problems when storing in Binary Search Trees
    • Cannot store data that is not comparable
    • Insert and search are
  • Utilizes the constant time access of Arrays
  • Process
    • Data -[Hash Function] Hash Code -[Reduce] Index
      • Reduce(Hash): Hash % Buckets
    • Insert data at index
    • If there is other data at that index, then use a collision resolution mechanism
  • Hash Maps are a Hash Table that is not thread-safe

Hash Functions

  • Converts a data object to a hash code
  • Properties
    • Necessary
      • Input: Object x
      • Output: An integer representation of x
      • If x = y, H(x) = H(y)
    • Best if
      • If x != y, H(x) != H(y)
      • Evenly distributed
      • Easy to compute
  • Collisions = when two unique inputs have the same hash
    • Possible whenever the Pigeonhole Principle applies
    • That would always apply because giving every input a unique output would lead to overflows
  • Ex: Strings
    • H(x): (31 ^ char idx) * ascii value
    • Prime numbers prevent collisions because they prevent grouping when a modulo reduce function is applied

Terminology

  • Buckets
    • Total slots in the hash table structure
  • Load factor
    • elements / buckets
    • Elements is the number of elements, not rows (multiple elements in one row still increases the factor)
    • When surpasses a certain value, then move to a larger table using rehashed values
      • Hash stays the same, but the reduce function will adapt to the larger table
      • Happens when that is hit, not on the next insert
      • Is normally around 0.75

Collision Resolution Strategies

Separate Chaining (Closed Addressing)

  • Buckets store linked list, collisions are appended to the list
    • This can be optimized by using a binary search tree instead of a linked list
  • When searching, find the bucket that contains the item and then search the linked list for the element
  • Allows for a load factor > 1, though that is not preferable
  • Less commonly followed because the data is no longer stored contiguously, which slows down access times

Open Addressing (Closed Hashing)

  • All entries are directly in a bucket
  • Collisions are added to empty spot dependent on probing type
    • Linear Probing: next empty spot
    • Quadratic Probing: try hash + each time one is already filled… (prevents clustering)
  • When searching, follow the probing type until you find the element
  • Experimentally faster because contiguous memory access is faster
  • Index is not directly determined by the hash code (ie. index is open)
  • Finding after deletion is complicated
    • Because an element could have been shifted and then the first occupant was deleted, making it look like the element is not in the table
    • Solution: Statuses
      • Each bucket gets a status of never used, deleted, occupied
      • If you check a bucket at an index for an element and it was never used, the element is not in the table
      • If the status was the other two, you should probe until you either see a never used or find the element
  • We count each failed attempt as another collision (so if the probing function moves twice it counts as 2 collisions)

Performance

  • Search/Index/Delete
    • Best/Average case:
    • Worst case:
      • Everything is mapping to the same index (bad hash function)