Huffman Trees

  • Solves the optimization problem to compress a file into the minimum number of bits
  • A form of lossless compression
  • Does not compress when every character has the same weight
  • Greedy algorithm because the most frequently used characters get the shortest encoding

Structure

  • A full Binary Tree where each leaf corresponds to a character
  • We collect the frequency of different characters and build a tree whose paths represents characters
    • Characters with larger frequencies have shorter distances from the root
    • Branching left or right from a given node represents 0 or 1 respectively
    • The path taken to arrive at a character’s leaf is the Huffman encoding for that character

Algorithm

  1. Create a table with symbols and their frequencies
  2. Construct a set of trees with root nodes that contain each of the individual symbols and their weight (frequency)
  3. Place the set of trees into (min) priority queue
  4. while the priority queue has more than one item
    1. Remove the two trees with the smallest weights
    2. Combine them into a new binary tree in which the weight of the tree root is the sum of the weights of its children
    3. Insert the newly created tree back into the priority queue
  5. Traverse the resulting tree to obtain binary codes for characters (append 0 to code when towards left and 1 when towards right)
Procedure Huffman(C):
	n = C.size
	Q = priority_queue()
 
	for i = 1 to n  
		x = node(C[i])
		Q.push(x)
	end for
 
	while Q.size() is greater than 1
		Z = new node()
		Z.left = x = Q.pop
		Z.right = y = Q.pop
		Z.freq = x.freq + y.freq Q.push(Z)
	end while
	
	Return Q.top()