Dictionary is a really useful data type built into Python, basically it is a number of objects that are indexed by keys, the key here must be hashable. Here is one simple dictionary usage:
>>> d = {'fengsp': 10, 'amy': 12}
>>> d['fengsp']
10
>>> del d['fengsp']
>>> d.keys()
['amy']
We retrieve the value a lot of times, the retrieval of one object by the key must be a very fast operation. For the CPython itself, several language features are supported with the help of dictionaries, for example, class instances use a dictionary to store attributes, the performance of the dictionary is essential.
Inside the CPython, dictionary is a C structure, PyDictObject
:
struct PyDictObject {
PyObject_HEAD
Py_ssize_t ma_fill; /* # Active + # Dummy */
Py_ssize_t ma_used; /* # Active */
Py_ssize_t ma_mask;
PyDictEntry *ma_table;
PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
PyDictEntry ma_smalltable[PyDict_MINSIZE];
};
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;
The fields in the data structure are:
ma_fill
Number of active and dummy entries. If you delete a key, the entry will become
a dummy entry and ma_fill remains the same, if you add a new key and the new
key doesn't occupy a dummy entry, this is increased by 1.
ma_used
Number of active entries. If you add a new key, it is increased by 1, if you
delete a key, it is decreased by 1.
ma_mask
Bitmask of the hash table, the table contains ma_mask + 1 slots. We store the
mask instead of the size because when we are looking up the entry for a key,
`slot = key_hash & mask` is used to get the slot index.
ma_table
An array of `PyDictEntry` structures, the `PyDictEntry` contains the key object,
the value object, and the key's hash, the key's hash is stored as a cache, for
example, when we are searching for a key, we can use the cached hash to perform
a fast comparison.
ma_lookup
A pointer to the function used to look up keys. Initially this is set to
`lookdict_string`, `lookdict_string` assumes that the dictionary keys are all
`PyStringObject`, this is an optimization so that looking up a key can be much
faster for these `StringDictObject`. If a key is not a `PyStringObject`, the
`ma_lookup` is changed to a general lookup function which is slower.
ma_smalltable
An eight slot hash table. Small dictionaries can be stored here and no `malloc()`
would happen.
Two keys could hash to the same slot, that's called a collision. When a collision happens, Python use open addressing to solve it: if the slot doesn't contain the key, find another slot, for example, here is one simple approach, if slot i doesn't contain the key, try i+1, i+2, and so on. For every hash, we now have a defined list of slots that could contain it, if we delete one of the keys from it, our list would be broken, that's why we need a dummy entry here. The simple open addressing linear algorithm could end up with linear pile-up, that causes poor performance because we need to scan all these slots everytime we look up a key, in real life, CPython use the following algorithm:
DUMMY = 'dummy'
# I am not clever enough to understand how the algorithm is came up with,
# this eventually covers every integer between 0 and ma_mask.
def open_addressing_in_cpython(table, key, hash):
free_slot = None
perturb = hash
i = slot_index = hash & ma_mask
while table[slot_index] is not None and table[slot_index].key != key:
if table[slot_index].key is DUMMY and free_slot is None:
free_slot = slot_index
i = (5 * i + perturb + 1)
slot_index = i & ma_mask
perturb >>= 5
if table[slot_index] is None and free_slot is not None:
return free_slot
return slot_index
If we keep adding keys to the dictionary, there won't be enough space to hold all
the keys, now we need to resize the hash table. The CPython would check for the table
size everytime we add a key, if the table is two-thirds full (ma_fill
), CPython would
resize the hash table. If a dictionary has 50000 keys or fewer, the new size is
ma_used * 4
, otherwise, it is ma_used * 2
. The table won't be resized if we delete
a lot of keys from the dictionary, this means your hash table size is not going to be
smaller, this isn't a big issue because mostly we use the dictionary for a while and then
we discard the whole dictionary. If you build a really large dictionary and then delete
many keys from it, you should build a new one with the remaining keys.
Many dictionary instances are created and destroyed frequently, in order to reduce
the number of creation and destruction, a free_dicts
array is used to hold dictionary
objects that are not in use anymore, it is simply one cache. If we need a PyDictObject
,
it would be taken from the free list if available.
The dictionary uses more memory than is necessary when used as an object attributes container as the keys are the same and they are replicated for each instance. Since Python 3.6, attribute dictionaries share keys with other attribute dictionaries of instances of the same class, for example if you have a class like this:
class User(object):
def __init__(self, username, email):
self.username = username
self.email = email
The attribute dictionary would be stored like this:
# this is shared between attribute dictionaries
# this is also ordered
keys = [
(5317300778844242624, 'username'),
(268341141884068675, 'email'),
]
# this hash table just stores the index to the key entries
# it is compact
index_table = [0, None, None, 1, None, None, None, None]
# this is values
values = ['user01', 'user01@example.com']
The CPython Dictionary implementation is very straightforward and easy to understand.
The code is beautiful, and it performs well, all variables are chosen by experiments,
for example, the size of free_dicts
and the size of ma_smalltable
. I love it.