# This is how a (dumb) hash table works How beautiful is `{}`?

It lets you store values by key, and retrieve them in a very cost-efficient manner (`O(1)`, more on this later).

In this post I want to implement a very basic hash table, and have a look at its inner workings to explain one of the most ingenious ideas in computer science.

## The problem

Imagine you’re building a new programming language: you start by having pretty simple types (strings, integers, floats, …) and then proceed to implement very basic data structures — first you come up with the array (`[]`), then comes the hash table (otherwise known as dictionary, associative array, hashmap, map and…the list goes on).

Ever wondered how they work? How they’re so damn fast?

Well, let’s say that JavaScript did not have have `{}` or `new Map()`, and let’s implement our very own `DumbMap`!

## A note on complexity

Before we get the ball rolling, we need to understand how complexity of a function works: Wikipedia has a good refresher on computational complexity, but I’ll add a brief explanation for the lazy ones.

Complexity measures how many steps are required by our function — the fewer steps, the faster the execution (also known as “running time”).

Let’s a look at the following snippet:

The computational complexity (from now simply “complexity”) of `fn` is `O(1)`, meaning that it’s constant (you can read `O(1)` as “the cost is one”): no matter what arguments you pass, the platform that runs this code only has to do one operation (multiply `n` into `m`). Again, since it’s one operation, the cost is referred as `O(1)`.

Complexity is measured by assuming arguments of your function could have very large values. Let’s look at this example:

You would think its complexity is `O(3)`, right?

Again, since complexity is measured in the context of very large arguments, we tend to “drop” constants and consider `O(3)` the same as `O(1)`. So, even in this case, we would say that the complexity of `fn` is `O(1)`. No matter what the value of `n` and `m` are, you always end up doing 3 operations — which, again, is a constant cost (therefore `O(1)`).

Now this example is a little bit different:

As you see, we’re looping as many times as the value of `n`, which could be in the millions. In this case we define the complexity of this function as `O(n)`, as you will need to do as many operations as the value of one of your arguments.

Other examples?

This examples loops `2 * n` times, meaning the complexity should be `O(2n)`. Since we mentioned that constants are “ignored” when calculating the complexity of a function, this example is also classified as `O(n)`.

One more?

Here we are looping over `n` and looping again inside the main loop, meaning the complexity is “squared” (`n * n`): if `n` is 2, we will run `s.push(m)` 4 times, if 3 we will run it 9 times, and so on.

In this case, the complexity of the function is referred as `O(n²)`.

One last example?

In this case we don’t have nested loops, but we loop twice over two different arguments: the complexity is defined as `O(n+m)`. Crystal clear.

Now that you’ve just got a brief introduction (or refresher) on complexity, it’s very easy to understand that a function with complexity `O(1)` is going to perform much better than one with `O(n)`.

Hash tables have a `O(1)` complexity1: in layman’s terms, they’re superfast. Let’s move on.

## Let’s build a (dumb) hash table

Our hash table has 2 simple methods — `set(x, y)` and `get(x)`. Let’s start writing some code:

ans let’s implement a very simple, inefficient way to store these key-value pairs and retrieve them later on. We first start by storing them in an internal array (remember, we can’t use `{}` since we are implementing `{}` — mindblown!):

then it’s simply a matter of getting the right element from the list:

Our full example:

Our `DumbMap is amazing`! It works right out of the box, but how will it perform when we add a large amount of key-value pairs?

Let’s try a simple benchmark — we will first try to find a non-existing element in an hash table with very few elements, and then try the same in one with a large quantity of elements:

The results? Not so encouraging:

In our implementation, we need to loop through all the elements inside `this.list` in order to find one with the matching key. The cost is `O(n)`, and it’s quite terrible.

## Make it fast(er)

We need to find a way to avoid looping through our list: time to put hash back into the hash table.

Ever wondered why this data structure is called hash table? That’s because a hashing function is used on the keys that you set and get: we will use this function to turn our key into an integer `i`, and store our value at index `i` of our internal list. Since accessing an element, by its index, from a list has a constant cost (`O(1)`), then the hash table will also have a cost of `O(1)`.

Let’s try this out:

Here we are using the string-hash module which simply converts a string to a numeric hash, and use it to store and fetch elements at index `hash(key)` of our list. The results?

W – O – W. This is what I’m talking about!

We don’t have to loop through all elements in the list and retrieving elements from `DumbMap` is fast as hell!

Let me put this as straightforward as possible: hashing is what makes hash tables extremely efficient. No magic. Nothing more. Nada. Just a simple, clever, ingenious idea.

## The cost of picking the right hashing function

Of course, picking a fast hashing function is very important: if our `hash(key)` runs in a few seconds, our function will be quite slow regardless of its complexity.

At the same time, it’s very important to make sure that our hashing function doesn’t produce a lot of collisions, as they would be detrimental to the complexity of our hash table.

Confused? Let’s take a closer look at collisions.

## Collisions

You might think “Ah, a good hashing function never generates collisions!”: well, come back to the real world and think again. Google was able to produce collisions for the SHA-1 hashing algorithm, and it’s just a matter of time, or computational power, before a hashing function cracks and returns the same hash for 2 different inputs. Always assume your hashing function generates collisions and implement the right defense against such cases.

Case in point, let’s try to use a `hash()` function that generates a lot of collisions:

This function uses an array of 10 elements to store values, meaning that elements are likely to be replaced — a nasty bug in our `DumbMap`:

In order to resolve the issue, we can simply store multiple key-value pairs at the same index — let’s amend our hash table:

As you might notice, here we fall back to our original implementation: store a list of key-value pairs and loop through each of them, which is going to be quite slow when there are a lot of collisions for a particular index of the list.

Let’s benchmark this using our own `hash()` function that generates indexes from 1 to 10:

and by using the hash function from `string-hash`, which generates random indexes:

Whoa! There’s the cost of picking the right hashing function — fast enough that it doesn’t slow our execution down on its own, and good enough that it doesn’t produce a lot of collisions.

## Generally O(1)

Remember my words?

Hash tables have a `O(1)` complexity

Well, I lied: the complexity of an hash table depends on the hashing function you pick. The more collisions you generate, the more the complexity tends toward `O(n)`.

A hashing function such as:

would mean that our hash table has a complexity of `O(n)`.

This is why, in general, computational complexity has 3 measures: best, average and worst-case scenarios. Hash tables have a `O(1)` complexity in best and average case scenarios, but fall to `O(n)` in their worst-case scenario.

Remember: a good hashing function is the key to an efficient hash table — nothing more, nothing less.

## More on collisions…

The technique we used to fix `DumbMap` in case of collisions is called separate chaining: we store all the key-pairs that generate collisions in a list and loop through them.

Another popular technique is open addressing:

• at each index of our list we store one and one only key-value pair
• when trying to store a pair at index `x`, if there’s already a key-value pair, try to store our new pair at `x + 1`
• if `x + 1` is taken, try `x + 2` and so on…
• when retrieving an element, hash the key and see if the element at that position (`x`) matches our key
• if not, try to access the element at position `x + 1`
• rinse and repeat until you get to the end of the list, or when you find an empty index — that means our element is not in the hash table

Smart, simple, elegant and usually very efficient!

## FAQs (or TL;DR)

#### Does a hash table hash the values we’re storing?

No, keys are hashed so that they can be turned into an integer `i`, and both keys and values are stored at position `i` in a list.

#### Do the hashing functions used by hash tables generate collisions?

Absolutely — so hash tables are implemented with defense strategies to avoid nasty bugs.

#### Do hash tables use a list or a linked list internally?

It depends, both can work. In our examples, we use the JavaScript array (`[]`) that can be dynamically resized:

#### Why did you pick JavaScript for the examples? JS arrays ARE hash tables!

For example:

I know, damn JavaScript.

JavaScript is “universal” and probably the easiest language to understand when looking at some sample code. I agree JS might not be the best language, but I hope these examples are clear enough.

#### Is your example a really good implementation of an hash table? Is it really THAT simple?

No, not at all.

Have a look at “implementing a hash table in JavaScript” by Matt Zeunert, as it will give you a bit more context. There’s a lot more to learn, so I would also suggest you to also have a look at:

## In the end…

Hash tables are a very clever idea we use on a regular basis: no matter whether you create a dictionary in Python, an associative array in PHP or a Map in JavaScript — they all share the same concepts and beautifully work to let us store and retrieve element by an identifier, at a (most likely) constant cost.

A special thanks goes to Joe who helped me by reviewing this article.

Notes
1. I know, I know, I’m kinda lying here. Read until the end of the post ;-) Hi there! I recently wrote an ebook on web application security, currently sold on leanpub, the Amazon Kindle store and gumroad.

It contains 160+ pages of content dedicated to securing web applications and improving your security awareness when building web apps, with chapters ranging from explaining how to secure HTTP cookies with the right flags to understanding why it is important to consider joining a bug bounty program.

Feel free to skim through some of the free chapters published on this blog and, if the content seems interesting enough to you, grab a copy on leanpub, the Amazon Kindle store, gumroad or simply checkout right down below!

### In the mood for some more reading?

...or check the archives.