Introduction to Hashing: What You Need to Know

Introduction to Hashing: What You Need to Know

Hashing 101: What It Is and Why It Matters

Ever head of associative arrays? also known as map or dictionaries. One of the most popular data structures to implement associative arrays as hash tables.

Wait but what is hashing?

Hashing is a technique used to store and retrieve data as quickly as possible. It is used to perform optimal searches and is widely used in Symbol Tables. Symbol Tables are typically used in compilers and stores items such as variable names, constant, procedures, function names etc..

Hashing is commonly used in applications where fast data retrieval is crucial, with an average time complexity of O(1). Although the worst-case time complexity can be O(n), the key advantage of hashing lies in its average-case efficiency, which remains constant and enables quick lookups.

Hash Table ADT

The hash table structure is unordered collection of associations between a key and value. The keys are all unique in Hash Table so there is a one-to-one relationship between a key and value. The operations are given below :-

  • HashTable :- Creates a new hash table

  • Get :- Searches in hash table with key and return the value if it finds the element with given key.

  • Put :- Inserts a new key-value pair into hash table.

  • Delete :- Deletes a key-value pair from hash table

  • DeleteHashTable :- Deletes the hash table

So, it's a key-value storage where keys are strings and values are data? Well, not exactly. Keys can also be integers. But if we use numbers as keys, won’t it grow infinitely, causing the same issue as arrays?

This is where a hash function becomes essential. Hash functions provide a way to quickly compute the memory location of a key, significantly reducing the time to find or insert data.

The process of mapping the keys to available memory locations is called hashing, and it allows for average O(1) time complexity in lookups, making it much more efficient than arrays in scenarios with complex keys or large datasets. This brings us to next topic components of hashing.

Components of Hashing

It mainly has 4 key components :-

  • Hash Table

  • Hash Function

  • Collisions

  • Collision Resolution Techniques

Hash Table

With an array we can store the elements whose key K is at a position K of they array. That means given a key K, we find the element whose key is k by just looking at Kth position of array. This is known as direct addressing.

Direct Addressing is applicable where we can afford to allocate an array with one position for every possible key. But we wont have enough space every time. In these scenarios its best to use hash tables. Hash table or hash map is a data structure that stores the keys and their associated values.

The general convention is that we use a hash table when the number of keys actually stored is small relative to the number of possible keys.

Hash table is collection of items which are stored in such a way as to make it easy to find them later. Each position of the hash table often referred as a slot (or bucket) can hold an item and is named an integer value starting at 0.

Hash Function

A hash function is a mathematical function used to transform a key (such as a number, string, or object) into a slot index or bucket index within a hash table. The goal of a hash function is to distribute keys as evenly as possible across the available slots to minimize collisions, where two or more keys map to the same index.

An ideal hash function would map each item to a unique slot, known as a perfect hash function. Unfortunately, perfect hash functions are rare in practice, especially when the number of possible keys is large or infinite. As a result, collisions are unavoidable in most real-world applications.

Good hash functions aim to minimize collisions by distributing keys uniformly across the available slots, ensuring that different keys are unlikely to end up in the same bucket. To achieve this, hash functions typically take into account the characteristics of the keys, such as their numerical or string properties, and apply transformations that scatter the keys over the hash table's index range.

Key properties of a good hash function include:

  • Determinism: The same key should always hash to the same index.

  • Uniform Distribution: The keys should be spread uniformly across the hash table, minimizing clustering.

  • Efficiency: The hash function should be computationally efficient, allowing for quick insertion and lookup operations.

  • Low Collisions: A good hash function should minimize the number of collisions to reduce the need for collision resolution strategies like separate chaining or probing.

Load Factor

The load factor of non empty hash table is the number of items stored in the table divided by the size of the table. This is the decision parameter used when we want to rehash or expand the existing hash table entries. It essentially tells whether the hash function is distributing they keys uniformly or not.

$$\text{Load factor} = \frac{\text{Number of elements in hash table}}{\text{Size of hash table}}$$

Collisions

Hash functions are used to map each key to different address space, but practically its not possible to create such a hash function and the problem is called collision. It is the condition where 2 keys are hashed to same slot.

Collision Resolution Techniques

Fortunately, there are effective techniques to resolve conflicts created by collisions. The process of finding an alternate location for a key in the case of a collision is called collision resolution. Some of the techniques are :-

  • Direct Chaining (or closed addressing) :- An array of linked list application

    • Separate chaining
  • Open Addressing :- Array based implementation. In this all keys are stored in the hash table itself and is known as closed hashing.

    • Linear probing (linear search)

    • Quadratic probing (non linear search)

    • Double hashing (use multiple hash functions)

Separate Chaining

In separate chaining, each slot in the hash table contains a pointer to a linked list (or chain) of elements. When a collision occurs, the new element is simply added to the list.

Example: Let's assume the hash function is hash(key) = key % tablesize .

Hash table:

IndexList (Chaining)
010 → 15
1
220
3
430 → 25

When the key 15 is inserted and hashes to index 0, it is added to the chain at that index, as 10 is already there.


Linear Probing

In linear probing, when a collision occurs, the algorithm looks for the next available slot by moving one position forward (wrapping around the table if necessary). If the slot is occupied, it checks the next one, continuing this until an empty slot is found.

Example: Hash function: hash(key) = (n + 1 ) % tablesize .

Hash table:

IndexValue
010
115
220
325
430

Initially, key 15 hashes to index 0, but index 0 is occupied by 10, so we move to the next available slot, which is index 1.


Quadratic Probing (Mid-Squaure Method)

Quadratic probing solves collisions by checking indices that are quadratically spaced from the original collision location. Instead of moving linearly, it jumps by larger intervals determined by a quadratic function.

Example: Hash function: hash(key) = (n + k^2 ) % tablesize .

Hash Table:

Initially, the hash table is empty, and we insert the keys one by one.

KeyHash Value (key % 5)Inserted Index
1010 % 5 = 00
2020 % 5 = 02 (after quadratic probing)
3030 % 5 = 04 (after quadratic probing)
2525 % 5 = 03 (after quadratic probing)
1515 % 5 = 01 (after quadratic probing)
  1. Insert Key 10:

    • hash(10) = 10 % 5 = 0

    • Slot 0 is empty, so 10 is inserted at index 0.

  2. Insert Key 20:

    • hash(20) = 20 % 5 = 0

    • Slot 0 is occupied by 10, so we use quadratic probing.

    • First probe: new_index = (0 + 1^2) % 5 = (0 + 1) % 5 = 1

    • Slot 1 is empty, so 20 is inserted at index 1.

  3. Insert Key 30:

    • hash(30) = 30 % 5 = 0

    • Slot 0 is occupied by 10, so we probe using quadratic probing.

    • First probe: new_index = (0 + 1^2) % 5 = (0 + 1) % 5 = 1

    • Slot 1 is occupied by 20, so we probe again.

    • Second probe: new_index = (0 + 2^2) % 5 = (0 + 4) % 5 = 4

    • Slot 4 is empty, so 30 is inserted at index 4.


Double Hashing

In double hashing, two hash functions are used to resolve collisions. If a collision occurs, a second hash function is applied to find the next slot. This method reduces clustering by making the jumps between slots more unpredictable.

Example: Hash functions:

  1. h1(key) = key % 5

  2. h2(key) = 1 + (key % 4) (the second hash function)

Hash table:

IndexValue
010
115
220
325
430

For key 15, h1(15) gives index 0, which is occupied. We apply the second hash function, h2(15) = 1 + (15 % 4) = 4, so the key is inserted at index 1.


Comparison of Methods

  • Separate Chaining: Simple, effective with dynamic data, but requires extra memory for linked lists.

  • Linear Probing: Simple but suffers from clustering, where many elements are stored close to each other.

  • Quadratic Probing: Avoids clustering to some extent, but requires careful handling to ensure all slots are reachable.

  • Double Hashing: More complex but generally provides good distribution and avoids clustering better than linear probing.

Each technique has its trade-offs and is suited for different use cases depending on the hash table size, expected number of collisions, and performance needs.

Hashing Techniques

There are two types of hashing techniques: static hashing and dynamic hashing.

  1. Static Hashing: The hash function and table size are fixed. It works well when the number of entries is stable, but suffers from overflow and wasted memory when entries exceed or fall below the table's capacity.

  2. Dynamic Hashing: The table size adjusts dynamically based on the load factor. When the load factor exceeds a threshold, the table is resized, and elements are rehashed, maintaining efficiency as entries grow or shrink.

Problems Where Hash Tables are Not Suitable

Despite their general efficiency, hash tables are not suitable for every scenario. Some cases where hash tables may not perform well or may not be the best choice include:

  1. Ordered Data Access: Hash tables don’t maintain order, so for sorted data retrieval (e.g., range queries), balanced trees like AVL or Red-Black trees are more appropriate.

  2. Memory Constraints: Hash tables can consume excessive memory to prevent collisions, making simpler structures like arrays or linked lists more space-efficient in memory-limited environments.

  3. Poorly Distributed Keys: If a hash function poorly distributes keys, performance can degrade to 𝑂 ( 𝑛 ) O(n), especially with patterned data like sequential integers.

  4. Real-Time Systems: Hash tables may not be suitable due to unpredictable insertion/search times, especially during rehashing, making arrays or heaps more reliable.

  5. Too Many Collisions: Excessive collisions reduce performance. In such cases, other data structures without collisions should be considered.

Conclusion

Hash tables offer fast O(1) average-case performance for insertions, deletions, and lookups. However, they require well-designed hash functions and collision resolution methods. For ordered data, memory constraints, or real-time systems, alternative structures like balanced trees or heaps may be more suitable. Understanding these trade-offs ensures efficient application design.