Python dictionary, the implementation

written by Shipeng Feng on 2017-03-29

Dictionary is a really useful data type built into Python, basically it is a number of objects that are indexed by keys, the key here must be hashable. Here is one simple dictionary usage:

>>> d = {'fengsp': 10, 'amy': 12}
>>> d['fengsp']
>>> del d['fengsp']
>>> d.keys()

We retrieve the value a lot of times, the retrieval of one object by the key must be a very fast operation. For the CPython itself, several language features are supported with the help of dictionaries, for example, class instances use a dictionary to store attributes, the performance of the dictionary is essential.

The PyDictObject

Inside the CPython, dictionary is a C structure, PyDictObject:

struct PyDictObject {
    Py_ssize_t ma_fill;  /* # Active + # Dummy */
    Py_ssize_t ma_used;  /* # Active */
    Py_ssize_t ma_mask;

    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
    PyDictEntry ma_smalltable[PyDict_MINSIZE];

typedef struct {
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

The fields in the data structure are:


Number of active and dummy entries.  If you delete a key, the entry will become
a dummy entry and ma_fill remains the same, if you add a new key and the new
key doesn't occupy a dummy entry, this is increased by 1.


Number of active entries.  If you add a new key, it is increased by 1, if you
delete a key, it is decreased by 1.


Bitmask of the hash table, the table contains ma_mask + 1 slots.  We store the
mask instead of the size because when we are looking up the entry for a key,
`slot = key_hash & mask` is used to get the slot index.


An array of `PyDictEntry` structures, the `PyDictEntry` contains the key object,
the value object, and the key's hash, the key's hash is stored as a cache, for 
example, when we are searching for a key, we can use the cached hash to perform
a fast comparison.


A pointer to the function used to look up keys.  Initially this is set to
`lookdict_string`, `lookdict_string` assumes that the dictionary keys are all
`PyStringObject`, this is an optimization so that looking up a key can be much
faster for these `StringDictObject`.  If a key is not a `PyStringObject`, the
`ma_lookup` is changed to a general lookup function which is slower.


An eight slot hash table.  Small dictionaries can be stored here and no `malloc()`
would happen.


Two keys could hash to the same slot, that's called a collision. When a collision happens, Python use open addressing to solve it: if the slot doesn't contain the key, find another slot, for example, here is one simple approach, if slot i doesn't contain the key, try i+1, i+2, and so on. For every hash, we now have a defined list of slots that could contain it, if we delete one of the keys from it, our list would be broken, that's why we need a dummy entry here. The simple open addressing linear algorithm could end up with linear pile-up, that causes poor performance because we need to scan all these slots everytime we look up a key, in real life, CPython use the following algorithm:

DUMMY = 'dummy'

# I am not clever enough to understand how the algorithm is came up with,
# this eventually covers every integer between 0 and ma_mask.
def open_addressing_in_cpython(table, key, hash):
    free_slot = None
    perturb = hash
    i = slot_index = hash & ma_mask
    while table[slot_index] is not None and table[slot_index].key != key:
        if table[slot_index].key is DUMMY and free_slot is None:
            free_slot = slot_index
        i = (5 * i + perturb + 1)
        slot_index = i & ma_mask
        perturb >>= 5
    if table[slot_index] is None and free_slot is not None:
        return free_slot
    return slot_index

Hash Table Size

If we keep adding keys to the dictionary, there won't be enough space to hold all the keys, now we need to resize the hash table. The CPython would check for the table size everytime we add a key, if the table is two-thirds full (ma_fill), CPython would resize the hash table. If a dictionary has 50000 keys or fewer, the new size is ma_used * 4, otherwise, it is ma_used * 2. The table won't be resized if we delete a lot of keys from the dictionary, this means your hash table size is not going to be smaller, this isn't a big issue because mostly we use the dictionary for a while and then we discard the whole dictionary. If you build a really large dictionary and then delete many keys from it, you should build a new one with the remaining keys.

Free List

Many dictionary instances are created and destroyed frequently, in order to reduce the number of creation and destruction, a free_dicts array is used to hold dictionary objects that are not in use anymore, it is simply one cache. If we need a PyDictObject, it would be taken from the free list if available.

Key-Sharing And Ordered Dict

The dictionary uses more memory than is necessary when used as an object attributes container as the keys are the same and they are replicated for each instance. Since Python 3.6, attribute dictionaries share keys with other attribute dictionaries of instances of the same class, for example if you have a class like this:

class User(object):
    def __init__(self, username, email):
        self.username = username = email

The attribute dictionary would be stored like this:

# this is shared between attribute dictionaries
# this is also ordered
keys = [
    (5317300778844242624, 'username'),
    (268341141884068675, 'email'),
# this hash table just stores the index to the key entries
# it is compact
index_table = [0, None, None, 1, None, None, None, None]

# this is values
values = ['user01', '']

My Thoughts

The CPython Dictionary implementation is very straightforward and easy to understand. The code is beautiful, and it performs well, all variables are chosen by experiments, for example, the size of free_dicts and the size of ma_smalltable. I love it.