Thursday, 21 February 2019

Bazely thinking and the tale of the content-addressable cache...

Bazel is awesome.  It does a lot of things to ensure hermetic builds, etc.  But part of thinking in these terms means that the bazel engineers think about certain problems from a different point of view.  I discovered this in trying to debug a weird problem where my machine was working and a colleague's was failing both on a clean checkout.

It all came down to the content addressable cache.

So... what the heck am I talking about? Let me set the stage.


Misleading cache hits


I was working on a bazel conversion experiment (we're looking at migrating to bazel, but needed to try it out in a limited scope).  It uses kotlin and the kotlin rules require downloading the kotlin compiler, located on github.   The default in the bazel kotlin rules is 1.2.70 (at time of this writing), but I wanted to pull in a different version.  All well and good.  You set the version, you put in the sha256 of the file, and ... go.  On my machine it worked flawlessly.  On my colleague's machine, it dutifully downloaded the binary, and then threw a fit over a bad checksum.   I repeat... we had exactly the same checkout. Same environment... we thought.

I dug around and finally found out that (a) I had put in the wrong sha256 - the one from the default 1.2.70 version, and (b) it was satisfying my request, not from the network, but from a machine-wide "content-addressable" cache.  On my colleague's machine, it had never downloaded any files before for bazel, so it tried to satisfy it from the network and choked when it fingerprinted the file.  And (c) after digging I realized that it was supplying 1.2.70 when I asked for 1.2.71, because the content-addressable cache indexes by the hash, and the URL has no part in the cacheing.  It literally only cared about the name when it wrote the file contents from the cache into my build working directory.

Wait what?  The URL played no part in the cache index?

Must be a bug


I originally went to work up a repro-case, thinking that this is a clear bug.  I asked for http://github.com/blah/blah/blah/blah-1.2.71.zip, and it gave me 1.2.70, because I had put the wrong sha256 hash in.  It should have caught my error!  Bad bazel.  How dare it assume I knew what I was doing.  That's not safe infrastructure.   And then it hit me.  Bazel and I were thinking of the world differently.  The key was in the name "content addressable" (which I will call CA from now on, sorry Californians and Canadians). 

So, bazel was putting this in the CA cache because it was literally saying - the content is they key.  Whatever the file name is, if the content hashes to <some number>, then any request for something with the same number must want the same content.

I, on the other hand, was thinking in URL-centric terms. I wanted the file found at that location (I thought), and I supplied the hash to verify.  These aren't incompatible world-views for the most part. Usually I actually do want the content, I just assumed it's going to be downloaded from that location.  It's only around this error-handling question that they diverge, and when I consider Bazel's perspective on wanting to create fast, hermetic builds.

Bazel simply assumes that you mean it when you put in the expected sha256.  Bazel assumes you're not just naively cutting and pasting.  It'll check the first time you go to download it, but in this case, I used a perfectly valid sha256 hash for which it had a valid file.  And it served it. From the cache that is... addressed/keyed by the content of the file (or at least its hash).


Is bazel right here?


Yes and no.  This is a tricky thing - Bazel is using sha256 hashes to make sure builds are repeatable and immutable (same inputs lead to the same outputs) and this is in service of both security and performance. Bazel thinks "same inputs, same output", and partly doesn't give a crap where the content came from.  For most downloading rules you end up being able to give it multiple URLs, and it'll take whatever one it can use, as long as the content's hash matches.  Even there, it's not seeing a canonical "location" as the key, but the content itself.  It's largely only my prejudices that led me to assume otherwise.

To what benefit?  Well, assuming I don't make an error on my side, there are quite a few.  For one, download poisoning is harder in a variety of ways out of scope here. Additionally, anything addressable by sha256 can now be downloaded once, and never downloaded again from the wire, even on a clean build (since it isn't "dirtied" because the hashes are the same).  This can lead to a lot of benefits in continuous-integration machines where many projects downloading the same files over and over can avoid them. It also provides some solid ground for building distributed caches.

I went to file the bug, but have decided that this, surprisingly, is a feature, not a bug.


Thinking more bazely


Now that I have adjusted my expectations, I can think about the content (or the hash) as the unit of account, and things like URLs as the fetch mechanism for satisfying the content if it isn't already supplied.  This both should help me not make this particular mistake again, but also helps me understand a lot about the design decisions of the tooling. 

I realize that, in retrospect, this might seem obvious.  It seems that way to me, too, now.  Thinking about the web, privileging addresses not content is pretty easy.  It's hard to quantify or express in words, but a lot of subtle things that bugged me about Bazel's (and Starlark's) design and idiosyncrasies have smoothed out in my brain because of this simple perspective adjustment.  Hopefully it helps others reason more effectively about such things as well.