Data structures and Algorithms: Hashing

Keywords

Collision, Bucket, hash function, key

Hash tables can be used to implement the insert and find operations in constant average time

Its an array with searching of time o(1)

Collision – two keys mapped to same slot

Complexity :

insertions, deletions, and finds in constant average time

Applications :

Implement set in constant time

implement dictionaries, online spelling checkers. If misspelling detection (as opposed to correction) is important, an entire dictionary can be prehashed and words can be checked in constant time. Hash tables are well suited for this purpose because the words do not have to be alphabetized.

Compilers use hash tables to keep track of declared variables in source code. The data structure is called a symbol table.

Application	Action	Key	Value
phone book	look up phone number	person's name	phone number
dictionary	look up word	word	Definition
Internet DNS	Look up website by IP address	website	IP address
Reverse DNS	Look up IP address by web site	IP address	Website
genomics	amino acid dictionary	codon	amino acid
Java compiler	Find properties of variable	Variable name	Value and type
stock quote	Look up price of stock	stock symbol	Price
file share	find song to download	song name	Machine
file system	find file on hard drive	file name	location on hard drive

Hash functions – to map keys into index of buckets in table
Good hash function satisfies uniform hashing – each key equally hash(distributes) to slots
For example, a key that is a character string can be interpreted as an integer expressed in suitable radix notation. Thus, the identifier pt might be interpreted as the pair of decimal integers (112,116), since p = 112 and t = 116 in the ASCII character set; then, expressed as a radix-128 integer, pt becomes (112 128) + 116 = 14452. It is usually straightforward in any given application to devise some such simple method for interpreting each key as a (possibly large) natural number.
(1)    Division method :
      h(k) = k mod m .
Good values for m are primes not too close to exact powers of 2. For example, suppose we wish to allocate a hash table, with collisions resolved by chaining, to hold roughly n = 2000 character strings, where a character has 8 bits. We don't mind examining an average of 3 elements in an unsuccessful search, so we allocate a hash table of size m = 701. The number 701 is chosen because it is a prime near = 2000/3 but not near any power of 2. Treating each key k as an integer, our hash function would be h(k) = k mod 701
(2)    Multiplication method
(3)    Universal hashing

Collision by chaining –
Separate chaining hashing is a space efficient alternative to quadratic probing in which an array of linked lists is maintained

•    In chaining, we put all the elements that hash to the same slot in a linked list,
•    The worst-case behavior of hashing with chaining is terrible: all n keys hash to the same slot, creating a list of length n(like linked list). The worst-case time for searching is thus (n) plus the time to compute the hash function--no better than if we used one linked list for all the elements.

Collisions by open addressing -
•    In open addressing, all elements are stored in the hash table itself. That is, each table entry contains either an element of the dynamic set or NIL. When searching for an element, we systematically examine table slots until the desired element is found or it is clear that the element is not in the table. There are no lists and no elements stored outside the table, as there are in chaining
•    To perform insertion using open addressing, we successively examine, or probe, the hash table until we find an empty slot in which to put the key.
•    Three techniques are commonly used to compute the probe sequences required for open addressing: linear probing, quadratic probing, and double hashing.

Linear Probing

In linearprobing, collisions are resolved by sequentially scanning table (with wraparound) until an empty cell is found. For example, if there is only one free cell left in the table, we may have to search the entire table to find it

The find algorithm merely follows the same path as the insert algorithm. If it reaches an empty slot, the item we are searching for is not found; otherwise, it finds the match eventually.

Standard deletion cannot be performed because, as with a binary search tree, an item in the hash table not only represents itself, but it also connects other items by serving as a placeholder during collision resolution. Thus, if we removed 89 from the hash table, virtually all the remaining find opera- tions would fail. Consequently, we implement lazy deletion, or marking items as deleted rather than physically removing them from the table. This information is recorded in an extra data member. Each item is either active or deleted.

Quadratic Probing :

Quadratic probing examines cells 1,4,9, and so on, away from the original probe point.
if the hash function evaluates to H and a search in cell H is inconclusive, we try cells H + 1 *, H + 22. H + 3*, . . . , H + i2 (employing wraparound) in sequence

Double hashing is one of the best methods available for open addressing
h(k, i) = (h1(k) + ih2(k)) mod m,

Perfect hashing –
• Two level hashing scheme with universal hashing at each level
• First level same as hashing with chaining , instead of making a list of keys hashing to slot J, use a secondary hash table with associated hash function.