Organized Chaos — Sets and the Hash Table Factor
In Python, a set is a built-in data type that represents an unordered collection of unique elements. Sets are used to store multiple items in a single variable and do not allow duplicate elements.
The very nature of Sets is that they are “unordered”. When I first heard of this term, it didn’t seem like that big of a deal, but it is. To be unordered in data structures and programming refers to a lack of a specific or predictable sequence or arrangement. Meaning that the elements within a set do not follow any particular order, and there is no guarantee about the sequence in which the elements will appear when you iterate over or access them. Put succinctly, they don’t have indices. Unlike other data structures in Python like tuples or lists that do pertain to ‘order’, sets are one of a kind.
But why should the structure of a set be unique?
Sets are unordered by design simply because they don’t have to be ordered. The core idea of a set is to allow only unique values. This rudimentary idea requires only operations that can quickly access a value but don’t require methods to maintain a specific sequence. Indices store elements by arranging them in a line: 1, 2, 3,… but when the arrangement doesn’t matter, why have it at all? It’s as futile as going to the gym on your rest day. Let’s illustrate this concept with an example:
Imagine scrolling through TV channels without any particular order — it would be incredibly frustrating to find the channel you want. That’s why TV channels, much like tuples and lists, are carefully ordered, making it easy to navigate through them sequentially one by one. The time complexity in this case is O(n). Now consider Instagram followers. You don’t need them listed alphabetically; you can instantly access a follower’s profile just by searching for their username. That’s almost exactly how sets work — you can’t have two profiles with the same username, and a simple search of time O(1) gets you the profile you want. The search is where hash maps come into play.
The notion of position in a set is abstracted away from physical memory addresses and is instead managed through the hash table’s interval logic.
As hash tables prioritize fast access and storage efficiency over preserving sequence, they are the best fit for a set like a match made in computational heaven! Different data structures require different methods to handle them, and a hash table is a set’s ideal underlying structure. While sets are inherently unordered and may seem chaotic, hash tables are exceptional at organizing them.
The Hash Table Factor
Hash tables ensure every element of a set finds its place and can be found from its place without fuss. To see this in action, a hash map works sort of like a dictionary—it maps keys to index values. The only difference being, a hash table implements something called a hash function. Let’s take another example, this time an actual set of names:
set_names = {"John", "Kate", "Ariana", "Bob"}
Now that we have a names set, let’s introduce a hash table that has 8 slots (indices ranging from 0 to 7). The names in our set: "John," "Kate," "Ariana," and “Bob” are used as the keys in our hash structure. We can call them k1, k2, k3, and k4 for our reference. There are 3 steps to how hash tables work:
- The key is passed as an input to the hash function.
- The hash function converts the input key to a hash code—typically an integer.
- This hash code is then converted into the index number of the hash table.
This index is then used to determine where the key is stored in the hash table. To illustrate this in our case:
- ‘John’ (k1) is passed as an input to the hash function
- The hash function converts the string ‘John’ into a hash code say, 12345.
- To fit this hash code within the available slots in our table, the function applies a formula ;
12345 % 8
resulting in an index of 2.
‘John’ is stored at index 2 in the hash table. Whenever “John” is searched for, the hash function will again produce the hash code 12345, which will again map to index 2.
Why can’t we directly convert a string to an index without a hash code?
- keys in hash tables can be of complex data types such as strings, tuples or custom objects whose orientation is not of a natural integer representation that can be used directly as an index. Converting all types into an integer first seems the cleaner and more sensible way to go about it. Diverse key types just cannot be uniformly or directly mapped to a specific index in an array without some kind of revamp.
- Even within a single type, keys can vary widely in length and structure. Strings of different lengths or numeric values with vastly different ranges would need to be converted into a consistent and finite range of indices, which is not possible without hashing.
Note that I mentioned a finite range of indices here, but there is no such limitation to our input. Generally, we can have an infinite range of inputs, but there are limited slots available in a hash table. So how does python deal with this? Imagine trying to accommodate 10 people in a 5-bedroom flat for a day. Originally, we would want each person to have their own bedroom. But the number of people are greater than that of the number of bedrooms. So we do the next best thing — put 2 people in one bedroom. This way, we fit everyone. Hash Tables work with keys the same way when storing the values of a set. If your set has 200 keys, and the table has 50 slots, some keys share the same space; they have the same index number. That’s what we call a hash collision.
In my next article, I’ll explore how these collisions occur!