Monday, 11 April 2011

burrahobbit

burrahobbit, a project offering persistent data-structures seen in Clojure in Python, has just had its first release! See http://segfaulthunter.github.com/burrahobbit/ for details. It's based on the pdict.py script in my sandbox but now differs considerably from it by having a comprehensive test-suite and a lot of new features (including but not limited to sets and the binary operators). I highly appreciate any feedback, be it in the form a comment under this post, a bug report or feature request in the GitHub issue tracker or an email sent to my address which can be found in the box to the right.

Sunday, 13 March 2011

PersistentTreeMap in Python

Abstract

Hereinafter I will discuss an implementation of a PersistentTreeMap in Python. PersistentTreeMaps are a language feature of the Clojure programming language and the implementation thus presented ports this feature to Python. It is implemented in pure Python code and is thus not feasible for very large sets of data (it is doubtful that persistent datastructures are feasible at all for very large sets of data).

A word about persistence

So what's all the fuss about persistence about, anyway? A persistent datastructure is an immutable datastructure that (the clue is in the following clause) allows for the user to create many copies; the key to doing this is to make the copies cheap by sharing most of the data between the original and the copy.

Persistent data-structures are of enormous value in multi-threaded programming because data-structures can be assumed not to change their value, this assuring that no other thread may interfere with one's calculation.

Persistence in Python

The only thing that may actually change is the state of the key or value objects stored in the mapping (in fact, the key object should not be mutable as it must implement __hash__ and therefore should not be mutable).

As Python objects are essentially references, the data-structure has no power to prevent data stored in it from being changed (which it need not, because the cheap-copy of the datastructure actually points to the same object and thus is not changed). Thus, looking up a member always returns the same object, but that object's value may have changed with time; this is a restriction inherent to Python.

Concept of persistent HashTreeMaps

A tree structure consists of nodes which have at most as many children as the arity of the tree is; a node without any children is referred to as a leaf of the tree.

A binary tree (i.e. a tree in which each node that is not a leafnode has two children) will serve as an example to convey the concept of a HashTreeMap; a binary tree has several advantages for explanation purposes, most importantly the conciseness of its graphical illustration. Moreover, hashes of objects are assumed to be 3 bits long in this example, which is a quite unrealistic length in the real-world. An implementation with better performance characteristica and premisses that stand in the real-world will be explored afterwards.

There are three types of leaf-nodes: The AssocNode which stores a key to value association, a HashCollisionNode which contains a list of AssocNodes with colliding hashes (hashes which are equal while they key is not) and the NULLNODE which is a dummy node for dead-end leaves. Because there is no state associated to it, only one instance of NULLNODE is required and only this instance may be used for performance reasons.

Before continuing in our example, I need to elaborate the interface each node must export:

  • assoc(hsh, shift, node)
  • without(hsh, shift, key)
  • get(hsh, shift, key)

assoc returns a new tree with the mapping contained in the AssocNode node added (please ignore the additional arguments for the time being, their merits will be discussed later). without returns a new tree with the mapping with the key removed, it will throw a KeyError if the key is not present. get will return the value associated to key or, if no such association exists, raise a KeyError.

When a HashTreeMap is created, the tree only contains one node, being the NULLNODE. Let the key of the association we want to create be "hello", its hash 0b010 and the value "world". The hash of the key is passed to assoc as hsh, an AssocNode containing the association as node and 0 as shift. shift refers to the level of the tree the node whose assoc (or without or get for that matter). The new tree returned now contains the association of "hello" to "world", the old tree is left ontouched.

The insertion of the second association is a bit more tricky. Suppose you want to add an assocation from the key "spam" with the hash 0b001 to the value "eggs". We now need to create a branching point at the 0th level of the tree. The 0th branch points to the AssocNode containing the association from "hello" to "world", whereas the other one points to the one associating "spam" with "eggs". To understand this, consider the lowest bit of the hash the index of the branch to take at the first branching point.

Two AssocNodes.

Now let us consider one more association. Let the key be "foo", its hash 0b011 and the value "bar". The AssocNode that is the leaf of the 1-branch of the branching point on the 0th level now needs to be replaced by another branching point that contains the AssocNode for "hello" on the 0-branch and the AssocNode for "foo" on the 1-branch.

Three AssocNodes.

Lookup is really straightforward. The tree is walked until a leaf is reached, taking the branch represented by the value of the respective bit at every branching point; this means, the branch to be taken on the 0th level is decided by the 0th bit, the branch to be taken on the first level by the next bit, and so forth, until a leaf is eventually reached. If the leaf is a AssocNode, it is checked whether its key equals the one requested, if it is so, the value is returned, otherwise a KeyError is thrown. If the leaf node is a HashCollisionMode, the key of every of the colliding AssocNodes is checked against the requested key, and, if one of them matches, its value returned; if the leaf is a NullNode (which is only possible for empty trees with a binary tree), a KeyError is raised.

Lookup.

Removing an item (done by calling the without method) is done by walking the tree to the item that needs to be removed, and then replacing that with a NullNode (additionally removing any branching points rendered redundant by removing the item).

In the next post, I will elaborate the implementation of a tree maps that is feasible for real-life scenarios by being implemented as a 32-ary tree.

Saturday, 15 January 2011

info shred

The best way to remove something irretrievably is to destroy the media it's on with acid, melt it down, or the like. For cheap removable media like floppy disks, this is the preferred method. However, hard drives are expensive and hard to melt, so the `shred' utility tries to achieve a similar effect non-destructively.

Emphasis mine. The GNU folks have a strange kind of humour.

Saturday, 1 January 2011

Mozilla data disclosure 2

I have received a response with a link to a blog post concerning the incident, which says.
The database included 44,000 inactive accounts using older, md5-based password hashes. We erased all the md5-passwords, rendering the accounts disabled. All current addons.mozilla.org accounts use a more secure SHA-512 password hash with per-user salts. SHA-512 and per user salts has been the standard storage method of password hashes for all active users since April 9th, 2009.

It is important to note that current addons.mozilla.org users and accounts are not at risk. Additionally, this incident did not impact any of Mozilla’s infrastructure. This information was also sent to impacted users by email on December 27th

Mozilla data disclosure

Hello and happy New Year everyone!

I have just scanned my mailbox for new emails, and one of them particularly caught my attention: Its title was “Important notice about your addons.mozilla.org account”. To be honest, I did not remember I even had an account there, but because they claimed it were important, I read it. Reading it proved to be worthwhile and here are the most important passages of the text:
“Subject: Important notice about your addons.mozilla.org account
The purpose of this email is to notify you about a possible disclosure of your information which occurred on December 17th. […] On this date, we were informed by a 3rd party who discovered a file with individual user records on a public portion of one of our servers. […] This file was placed on this server by mistake and was a partial representation of the users database from addons.mozilla.org. The file included email addresses, first and last names, and an md5 hash representation of your password.
So effectively they are telling you that they hashed your full name, your email address and the MD5 hash of your password because someone put a file into public that did not belong there. What really puzzled me was not the fact that this awkward incident happened, but rather the wording of the emphasised sentence: It said nothing about whether salting was used to additionally protect the password; is it possible that a company like Mozilla does not consider the user passwords to be sensitive enough to add the additional security layer of salting?

But it is not all bad, they disabled the accounts where data was possibly leaked, investigated on the issue and promise not to let it happen again. The last sentence of the sentence also promises that they may add salting, if they do not already do it.
“The reason we are disclosing this event is because we have removed your existing password from the addons site and are asking you to reset it by going back to the addons site and clicking forgot password. We are also asking you to change your password on other sites in which you use the same password. Since we have effectively erased your password, you don't need to do anything if you do not want to use your account. It is disabled until you perform the password recovery. We have identified the process which allowed this file to be posted publicly and have taken steps to prevent this in the future. We are also evaluating other processes to ensure your information is safe and secure.
So, curious on whether salting is used at Mozilla's, I decided to contact the address they supplied for questions with the following email.
Subject: Protection of user data
Dear sir or madam!

The email your corporation has sent on 28.12.2010 concerning possible disclosure of user data on addons.mozilla.org has left me with the question of whether the user passwords stored in your database are salted or not. I consider salting of user passwords to be of utmost importance, as it significantly reduces the risk of disclosing
passwords by preventing the attacker to use a rainbow attack. The wording that elaborated which data has been leaked in the abovementioned email (“The file included email addresses, first and last names, and an md5 hash representation of your password.”) has left me with the impression that you do not salt user passwords and thus I would really appreciate if you could give me definite
information about the matter.

Best regards,
Florian Mayer
PS: I will make available publicly the information contained in the response to this email unless explicitly stated otherwise, as I firmly believe people have a right to know how the data they give away is secured and think the principles of openness repeatedly expressed by your corporation require the same.
To be continued …

Monday, 20 December 2010

man dbus-daemon

Wonderful. This shows only the most motivated and competent people work on dbus:
“Complex globs like "foo.bar.*" aren't allowed for now because they'd be work to implement and maybe encourage sloppy security anyway.”
— man dbus-daemon

Emphasis mine.

Wednesday, 24 November 2010

Fedora

Apparently the Fedora Live-CD does not like my Intel Fake RAID (I need to use one because I need Windows-compatibility) whereas the 3.3 GB full-install DVD does. This inconvenience has caused me to question my decision of replacing my Ubuntu system with Fedora, as it does not seem as if it were the sophisticated system I have been looking for either.