How beautiful is {}
?
It lets you store values by key, and retrieve them in a very cost-efficient manner
(O(1)
, more on this later).
In this post I want to implement a very basic hash table, and have a look at its inner workings to explain one of the most ingenious ideas in computer science.
The problem
Imagine you’re building a new programming language: you start by having pretty
simple types (strings, integers, floats, …) and then proceed to implement very basic
data structures — first you come up with the array ([]
), then comes the hash table
(otherwise known as dictionary, associative array, hashmap, map and…the list goes on).
Ever wondered how they work? How they’re so damn fast?
Well, let’s say that JavaScript did not have have {}
or new Map()
, and let’s
implement our very own DumbMap
!
A note on complexity
Before we get the ball rolling, we need to understand how complexity of a function works: Wikipedia has a good refresher on computational complexity, but I’ll add a brief explanation for the lazy ones.
Complexity measures how many steps are required by our function — the fewer steps, the faster the execution (also known as “running time”).
Let’s a look at the following snippet:
1 2 3 |
|
The computational complexity (from now simply “complexity”) of fn
is O(1)
,
meaning that it’s constant (you can read O(1)
as “the cost is one”): no matter
what arguments you pass, the platform that runs this code only has to do one
operation (multiply n
into m
). Again, since it’s one operation, the cost is
referred as O(1)
.
Complexity is measured by assuming arguments of your function could have very large values. Let’s look at this example:
1 2 3 4 5 6 7 8 9 |
|
You would think its complexity is O(3)
, right?
Again, since complexity is measured in the context of very large arguments,
we tend to “drop” constants and consider O(3)
the same as O(1)
. So, even in this case, we would say that the complexity of
fn
is O(1)
. No matter what the value of n
and m
are, you always end up
doing 3 operations — which, again, is a constant cost (therefore O(1)
).
Now this example is a little bit different:
1 2 3 4 5 6 7 8 9 |
|
As you see, we’re looping as many times as the value of n
, which could be in the
millions. In this case we define the complexity of this function as O(n)
, as you
will need to do as many operations as the value of one of your arguments.
Other examples?
1 2 3 4 5 6 7 8 9 |
|
This examples loops 2 * n
times, meaning the complexity should be O(2n)
.
Since we mentioned that constants are “ignored” when calculating the complexity
of a function, this example is also classified as O(n)
.
One more?
1 2 3 4 5 6 7 8 9 10 11 |
|
Here we are looping over n
and looping again inside the main loop, meaning the
complexity is “squared” (n * n
): if n
is 2, we will run s.push(m)
4 times,
if 3 we will run it 9 times, and so on.
In this case, the complexity of the function is referred as O(n²)
.
One last example?
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
In this case we don’t have nested loops, but we loop twice over two different arguments:
the complexity is defined as O(n+m)
. Crystal clear.
Now that you’ve just got a brief introduction (or refresher) on complexity, it’s
very easy to understand that a function with complexity O(1)
is going to perform
much better than one with O(n)
.
Hash tables have a O(1)
complexity1: in layman’s terms, they’re superfast.
Let’s move on.
Let’s build a (dumb) hash table
Our hash table has 2 simple methods — set(x, y)
and get(x)
. Let’s start writing
some code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
ans let’s implement a very simple, inefficient way to store these key-value pairs
and retrieve them later on. We first start by storing them in an internal array
(remember, we can’t use {}
since we are implementing {}
— mindblown!):
1 2 3 4 5 6 7 8 9 10 11 |
|
then it’s simply a matter of getting the right element from the list:
1 2 3 4 5 6 7 8 9 10 11 |
|
Our full example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Our DumbMap is amazing
! It works right out of the box, but how will it perform when we add a large amount of key-value
pairs?
Let’s try a simple benchmark — we will first try to find a non-existing element in an hash table with very few elements, and then try the same in one with a large quantity of elements:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
The results? Not so encouraging:
1 2 |
|
In our implementation, we need to loop through all the elements inside this.list
in order to find one with the matching key. The cost is O(n)
, and it’s quite
terrible.
Make it fast(er)
We need to find a way to avoid looping through our list: time to put hash back into the hash table.
Ever wondered why this data structure is called hash table? That’s because
a hashing function is used on the keys that you set and get: we will use this
function to turn our key into an integer i
, and store our value at index i
of our internal list. Since accessing an element, by its index, from a list has
a constant cost (O(1)
), then the hash table will also have a cost of O(1)
.
Let’s try this out:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Here we are using the string-hash
module which simply converts a string to a numeric hash, and use it to store
and fetch elements at index hash(key)
of our list. The results?
1
|
|
W – O – W. This is what I’m talking about!
We don’t have to loop through all elements in the list and retrieving elements
from DumbMap
is fast as hell!
Let me put this as straightforward as possible: hashing is what makes hash tables extremely efficient. No magic. Nothing more. Nada. Just a simple, clever, ingenious idea.
The cost of picking the right hashing function
Of course, picking a fast hashing function is very important: if our hash(key)
runs in a few seconds, our function will be quite slow regardless of its complexity.
At the same time, it’s very important to make sure that our hashing function doesn’t produce a lot of collisions, as they would be detrimental to the complexity of our hash table.
Confused? Let’s take a closer look at collisions.
Collisions
You might think “Ah, a good hashing function never generates collisions!”: well, come back to the real world and think again. Google was able to produce collisions for the SHA-1 hashing algorithm, and it’s just a matter of time, or computational power, before a hashing function cracks and returns the same hash for 2 different inputs. Always assume your hashing function generates collisions and implement the right defense against such cases.
Case in point, let’s try to use a hash()
function that generates a lot of collisions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
This function uses an array of 10 elements to store values, meaning that elements
are likely to be replaced — a nasty bug in our DumbMap
:
1 2 3 4 5 6 7 8 9 |
|
In order to resolve the issue, we can simply store multiple key-value pairs at the same index — let’s amend our hash table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
As you might notice, here we fall back to our original implementation: store a list of key-value pairs and loop through each of them, which is going to be quite slow when there are a lot of collisions for a particular index of the list.
Let’s benchmark this using our own hash()
function that generates indexes from 1 to 10:
1
|
|
and by using the hash function from string-hash
, which generates random indexes:
1
|
|
Whoa! There’s the cost of picking the right hashing function — fast enough that it doesn’t slow our execution down on its own, and good enough that it doesn’t produce a lot of collisions.
Generally O(1)
Remember my words?
Hash tables have a
O(1)
complexity
Well, I lied: the complexity of an hash table depends on the hashing function you
pick. The more collisions you generate, the more the complexity tends toward O(n)
.
A hashing function such as:
1 2 3 |
|
would mean that our hash table has a complexity of O(n)
.
This is why, in general, computational complexity has 3 measures: best, average
and worst-case scenarios. Hash tables have a O(1)
complexity in best and average case scenarios, but fall
to O(n)
in their worst-case scenario.
Remember: a good hashing function is the key to an efficient hash table — nothing more, nothing less.
More on collisions…
The technique we used to fix DumbMap
in case of collisions is called separate chaining:
we store all the key-pairs that generate collisions in a list and loop through
them.
Another popular technique is open addressing:
- at each index of our list we store one and one only key-value pair
- when trying to store a pair at index
x
, if there’s already a key-value pair, try to store our new pair atx + 1
- if
x + 1
is taken, tryx + 2
and so on… - when retrieving an element, hash the key and see if the element at that position (
x
) matches our key - if not, try to access the element at position
x + 1
- rinse and repeat until you get to the end of the list, or when you find an empty index — that means our element is not in the hash table
Smart, simple, elegant and usually very efficient!
FAQs (or TL;DR)
Does a hash table hash the values we’re storing?
No, keys are hashed so that they can be turned into an integer i
, and both keys
and values are stored at position i
in a list.
Do the hashing functions used by hash tables generate collisions?
Absolutely — so hash tables are implemented with defense strategies to avoid nasty bugs.
Do hash tables use a list or a linked list internally?
It depends, both can work.
In our examples, we use the JavaScript array ([]
) that can be dynamically resized:
1 2 3 4 5 6 |
|
Why did you pick JavaScript for the examples? JS arrays ARE hash tables!
For example:
1 2 3 4 5 6 7 8 |
|
I know, damn JavaScript.
JavaScript is “universal” and probably the easiest language to understand when looking at some sample code. I agree JS might not be the best language, but I hope these examples are clear enough.
Is your example a really good implementation of an hash table? Is it really THAT simple?
No, not at all.
Have a look at “implementing a hash table in JavaScript” by Matt Zeunert, as it will give you a bit more context. There’s a lot more to learn, so I would also suggest you to also have a look at:
- Paul Kube’s course on hash tables
- Implementing our Own Hash Table with Separate Chaining in Java
- Algorithms, 4th Edition – Hash tables
- Designing a fast hash table
In the end…
Hash tables are a very clever idea we use on a regular basis: no matter whether you create a dictionary in Python, an associative array in PHP or a Map in JavaScript — they all share the same concepts and beautifully work to let us store and retrieve element by an identifier, at a (most likely) constant cost.
Hope you enjoyed this article, and feel free to share your feedback with me.
A special thanks goes to Joe who helped me by reviewing this article.
Adios!
- I know, I know, I’m kinda lying here. Read until the end of the post ;-) ↩