In both the .git/annex directory and the git-annex branch, two levels of hash directories are used, to avoid issues with too many files in one directory.

Two separate hash methods are used.

  • hashdirmixed is only used for non-bare git repositories. (We'd like to stop using this, but it'd be too annoying to change all the git-annex symlinks!)

  • hashdirlower is used for bare git repositories, the git-annex branch, and on special remotes as well.

Note that git annex find and git annex examinekey can be used with the --format option to find the hash directories. The explanation below is only for completeness.

new hash format

This uses two directories, each with a three-letter name, such as "f87/4d5"

The directory names come from the md5sum of the key.

For example:

echo -n "SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" | md5sum

old hash format

This uses two directories, each with a two-letter name, such as "pX/1J"

It takes the md5sum of the key, but rather than a string, represents it as 4 32bit words. Only the first word is used. It is converted into a string by the same mechanism that would be used to encode a normal md5sum value into a string, but where that would normally encode the bits using the 16 characters 0-9a-f, this instead uses the 32 characters "0123456789zqjxkmvwgpfZQJXKMVWGPF". The first 2 letters of the resulting string are the first directory, and the second 2 are the second directory.

chunk keys

The same hash directory is used for a chunk key as would be used for the key that it's a chunk of.

The correct old hash value for the empty file SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 is pX/ZJ .

The text describes the old hash value computation incorrectly, because it doesn't mention that 1 bit is skipped between each group of 5 bits. See the sample implementation in display_32bits_as_dir in https://github.com/joeyh/git-annex/blob/master/Locations.hs

1c to support Péter's statement:

$> git annex examinekey --format='${hashdirmixed}' "SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
pX/ZJ/%  
are the characters "0123456789zqjxkmvwgpfZQJXKMVWGPF" chosen randomly for the base32 encoding or was there a reason to choose exactly these?
Comment by josch Sat Jan 31 17:13:57 2015

The only reason for the letter choice is that it avoids making random words with possibly unintentional meanings..

Comment by joey Wed Feb 4 17:14:24 2015

why the extra processing to generate the hashing directories?

we already have a hash here, for example, SHA256E-s8242375--5f82490990812ad3feabb02355750710a9d94283ab256d1c691c3bf8d7d9fbe3.ogg has a loon 5f82490990812ad3feabb02355750710a9d94283ab256d1c691c3bf8d7d9fbe3 hash. Why not use the first characters of that? This is will not change for a give file, and has a higher chance of generating collisions (which is a good thing here, because we can reuse directories).

In other words, why aren't the hashes of SHA256E-s8242375--5f82490990812ad3feabb02355750710a9d94283ab256d1c691c3bf8d7d9fbe3.ogg simply 5f8/249? --anarcat

Comment by https://id.koumbit.net/anarcat Fri Feb 13 15:59:46 2015
Not all types of keys contain hashes.
Comment by joey Tue Feb 17 21:51:59 2015
I wrote a Python implementation of the two hashing functions for a project of mine. Here it is, hoping it can be helpful for somone.
Comment by giomasce Sun Mar 22 22:38:54 2015
Comments on this page are closed.