Sunday, 13 March 2011

PersistentTreeMap in Python

Abstract

Hereinafter I will discuss an implementation of a PersistentTreeMap in Python. PersistentTreeMaps are a language feature of the Clojure programming language and the implementation thus presented ports this feature to Python. It is implemented in pure Python code and is thus not feasible for very large sets of data (it is doubtful that persistent datastructures are feasible at all for very large sets of data).

A word about persistence

So what's all the fuss about persistence about, anyway? A persistent datastructure is an immutable datastructure that (the clue is in the following clause) allows for the user to create many copies; the key to doing this is to make the copies cheap by sharing most of the data between the original and the copy.

Persistent data-structures are of enormous value in multi-threaded programming because data-structures can be assumed not to change their value, this assuring that no other thread may interfere with one's calculation.

Persistence in Python

The only thing that may actually change is the state of the key or value objects stored in the mapping (in fact, the key object should not be mutable as it must implement __hash__ and therefore should not be mutable).

As Python objects are essentially references, the data-structure has no power to prevent data stored in it from being changed (which it need not, because the cheap-copy of the datastructure actually points to the same object and thus is not changed). Thus, looking up a member always returns the same object, but that object's value may have changed with time; this is a restriction inherent to Python.

Concept of persistent HashTreeMaps

A tree structure consists of nodes which have at most as many children as the arity of the tree is; a node without any children is referred to as a leaf of the tree.

A binary tree (i.e. a tree in which each node that is not a leafnode has two children) will serve as an example to convey the concept of a HashTreeMap; a binary tree has several advantages for explanation purposes, most importantly the conciseness of its graphical illustration. Moreover, hashes of objects are assumed to be 3 bits long in this example, which is a quite unrealistic length in the real-world. An implementation with better performance characteristica and premisses that stand in the real-world will be explored afterwards.

There are three types of leaf-nodes: The AssocNode which stores a key to value association, a HashCollisionNode which contains a list of AssocNodes with colliding hashes (hashes which are equal while they key is not) and the NULLNODE which is a dummy node for dead-end leaves. Because there is no state associated to it, only one instance of NULLNODE is required and only this instance may be used for performance reasons.

Before continuing in our example, I need to elaborate the interface each node must export:

  • assoc(hsh, shift, node)
  • without(hsh, shift, key)
  • get(hsh, shift, key)

assoc returns a new tree with the mapping contained in the AssocNode node added (please ignore the additional arguments for the time being, their merits will be discussed later). without returns a new tree with the mapping with the key removed, it will throw a KeyError if the key is not present. get will return the value associated to key or, if no such association exists, raise a KeyError.

When a HashTreeMap is created, the tree only contains one node, being the NULLNODE. Let the key of the association we want to create be "hello", its hash 0b010 and the value "world". The hash of the key is passed to assoc as hsh, an AssocNode containing the association as node and 0 as shift. shift refers to the level of the tree the node whose assoc (or without or get for that matter). The new tree returned now contains the association of "hello" to "world", the old tree is left ontouched.

The insertion of the second association is a bit more tricky. Suppose you want to add an assocation from the key "spam" with the hash 0b001 to the value "eggs". We now need to create a branching point at the 0th level of the tree. The 0th branch points to the AssocNode containing the association from "hello" to "world", whereas the other one points to the one associating "spam" with "eggs". To understand this, consider the lowest bit of the hash the index of the branch to take at the first branching point.

Two AssocNodes.

Now let us consider one more association. Let the key be "foo", its hash 0b011 and the value "bar". The AssocNode that is the leaf of the 1-branch of the branching point on the 0th level now needs to be replaced by another branching point that contains the AssocNode for "hello" on the 0-branch and the AssocNode for "foo" on the 1-branch.

Three AssocNodes.

Lookup is really straightforward. The tree is walked until a leaf is reached, taking the branch represented by the value of the respective bit at every branching point; this means, the branch to be taken on the 0th level is decided by the 0th bit, the branch to be taken on the first level by the next bit, and so forth, until a leaf is eventually reached. If the leaf is a AssocNode, it is checked whether its key equals the one requested, if it is so, the value is returned, otherwise a KeyError is thrown. If the leaf node is a HashCollisionMode, the key of every of the colliding AssocNodes is checked against the requested key, and, if one of them matches, its value returned; if the leaf is a NullNode (which is only possible for empty trees with a binary tree), a KeyError is raised.

Lookup.

Removing an item (done by calling the without method) is done by walking the tree to the item that needs to be removed, and then replacing that with a NullNode (additionally removing any branching points rendered redundant by removing the item).

In the next post, I will elaborate the implementation of a tree maps that is feasible for real-life scenarios by being implemented as a 32-ary tree.

3 comments:

Anonymous said...

Looks like a good overview. Any actual code?

Florian Mayer said...

The actual implementation will be covered in a follow-up; if you wish to understand the source code on your own until then: https://github.com/segfaulthunter/sandbox/blob/master/pdict.py

Anonymous said...

Florian: Great, thanks!