I wrote this article when I worked a lot on Gitly back in 2012 but somehow never got around to publishing it. I thought it was too technical. Maybe someone will find it useful though so I’m posting it now for fun…
Just the other day I was chugging along nicely with the development of Gitly (my own Git client) when suddenly I stumbled across a peculiar issue. Gitly claimed that a file in the working directory was modified even though I know it wasn’t. No other Git client reported it as modified either so clearly it must be a bug in my code, right?
Obviously that was my initial thought as well so I started bug hunting. It must certainly be a problem with my code for calculating the SHA-1 hash for the file I thought. Well, it wasn’t. After debugging it for a while I got even more puzzled. It seemed that my code was right but that the values inside the Git repository was wrong! How could that be possible?!
- My program thinks that a file has been modified.
- I know that the file was not modified.
Seems like a bug that should be easy to fix, right? Well, it took me deep down the rabbit hole which is why I wanted to write about it…
(Bare with me if this is a little too technical…)
First, let’s take a step back and look at the whole picture here…
To find out what has changed in the working directory I compare the files of the latest commit in the repository with the files in the working directory. Each file has a SHA-1 hash that identifies it so to find out if a file is modified all I have to do is to compare the hashes. The file is a C# file called AppBootstrapper.cs that I knew for a fact hasn’t changed. In this particular case Gitly concludes that the files have different hashes and thus must be changed. My code that calculates the hash for this file in the working directory found it to be 2bdb764a6d2a8c7d92dc3f194f8a612c1f524795. But the file AppBootstrapper.cs is stored in the Git repository under a different hash which is eb72ad7dc91b71eba1fe6bf06b7186ed4c94a65b. Clearly something must be wrong here. Probably I am calculating something in the wrong way, right?
Let’s set the stage here. We have two hash codes for the same file. One of them must be wrong! Since hash codes a pretty long and scary let’s refer to them as hash code A and B like this:
- Hash A: 2bdb764a6d2a8c7d92dc3f194f8a612c1f524795
- Hash B: eb72ad7dc91b71eba1fe6bf06b7186ed4c94a65b
The theory I have now is that my code calculates hash A for the file but it must be doing it wrong. The correct hash should be hash B. Let’s examine the repository to see if I am right.
To dig into the repository I use the “real” Git command line client for Windows (msysgit). What does it say about this file?
$ git status AppBootstrapper.cs
# On branch master
nothing to commit (working directory clean)
Alright, this seems fine. The file is not modified according to msysgit. Let’s see if we can find out more. To calculate the hash for a file you can use the command git hash-object <filename>. Here is the result:
$ git hash-object AppBootstrapper.cs
What?! This is not what I expected. This hash is the exact same that my code produces. Something is fishy here.
Next step, dear Git – what is it that you have stored in the repository? It sure does not seem to be the same file that we have on disk. Let’s dig even deeper…
We can dump the contents of a Git object using git cat-file -p <hash> like this:
$ git cat-file -p 2bdb764a6d2a8c7d92dc3f194f8a612c1f524795
error: unable to find 2bdb764a6d2a8c7d92dc3f194f8a612c1f524795
fatal: Not a valid object name 2bdb764a6d2a8c7d92dc3f194f8a612c1f524795
Uh oh, we have nothing stored with that hash code? There is no object in the Git repository with that hash code. What about the other hash then?
$ git cat-file -p eb72ad7dc91b71eba1fe6bf06b7186ed4c94a65b
Okay, this is clearly the file we are looking for. However, there is some garbage at the start of the file. Could it possibly be the UTF-8 byte order mark? Maybe that is what is causing problems? Let’s dig a little deeper…
Git stores everything quite logically so looking up an object on disk is straight forward (as long as the repository is not in a packed format). I found the file stored under .git/objects/eb/72ad7dc91b71eba1fe6bf06b7186ed4c94a65b.
Okay what can we do? I found that git has another command to dump the contents of a blob that escapes special characters. You simple write git show <hash> to use it. Let’s see what it gives us:
$ git show eb72ad7dc91b71eba1fe6bf06b7186ed4c94a65b
Nifty indeed! It escaped the characters for us and now we clearly see that it indeed is the UTF-8 BOM which consists of the byte sequence 0xEF, 0xBB, 0xBF.
The problem is that the file on disk also has this exact same byte order mark. No luck there…
With a little help of some Ruby we can unpack the object from the Git repository and take a look inside:
>ruby -rzlib -e 'print Zlib::Inflate.new.inflate(STDIN.read)' < ./eb/72ad7dc91b71eba1fe6bf06b7186ed4c94a65b
-e:1:in `inflate': invalid distance too far back (Zlib::DataError)
from -e:1:in `<main>'
What the heck? There seems to be an error in the compressed file?
$ git fsck
Checking object directories: 100% (256/256), done.
dangling blob 4e81d9100e10ececbb12d8375710047d0a8a7b25
dangling blob 4b7a370cf24b0a9aec69950ffbcb51c5920437e0
No problem with the repository there.
Ok this calls for bigger guns! Let’s see what libgit2 does with this file…
I wrote a little program that calls libgit2 to get the hash code for the file, like this:
GIT_OBJ_BLOB) != GIT_SUCCESS)
std::cout << "Hash: " << s << std::endl;
Now guess what! This little bugger gives me hash code B! Very interesting! Here is a screenshot of my debugging session showing the hash code I got:
Digging into the git_odb_hashfile method a bit more I learn that the actual hashing simply hashes the raw data of the file without considering the core.autocrlf flag. Aha! It’s starting to make some sense now. It’s a bug in libgit2!
In this memory dump we clearly see a lot of similar sequences that I have highlighted which is nothing else but Carriage Return and Line Feed characters, i.e. standard Windows line endings. In other words, the file was committed to the repository without stripping the CR characters probably because it was committed before I happened to turn on the autocrlf flag.
Of course it’s pretty obvious once you have all the facts but these things can really make you go crazy before you solve them.
Line endings is the bane of git and I don’t know why this mess exists. In my opinion end users should not have to worry about this at all. It should all just work. Unfortunately I don’t have any solution for this.
So there you go, mystery solved! It was the bane of git – the line endings…
If you want to learn more about this problem you should check out this blog post: http://timclem.wordpress.com/2012/03/01/mind-the-end-of-your-line/
In this case I’m guessing that msysgit handles this by doing an actual diff on the contents. The diff will strip out line ending differences so the file will appear unchanged even if the hash codes mismatch. There you go, some git internals revealed!
Fixing the repository…
First I added a .gitattributes file with the following line:
# Ensure text files are normalized
This ensures that everyone working with the repository uses the same setting which is a good thing to ensure we don’t run into more problems in the future.
$ rm .git/index
$ git reset
Unstaged changes after reset:
Nifty! Git has now realized that this file needs to be updated. I just committed this changed file and that was it…