Wednesday 16 September 2020

Jackson, YAML, and Kotlin data classes, and SnakeYaml formatting

 There have been a few articles on how to use YAML parsing with kotlin data classes, including early frustrations with snakeyaml, how to do it with Jackson's wrapping of SnakeYaml, and generally you can get a functional setup going. 

Cool as far as it goes, but Jackson doesn't really expose much of SnakeYaml's DumperOptions (where all the fun config is). SnakeYaml has quite a few formatting options available to it which are essential to writing readable YAML, but these aren't exposed in the Jackson-supplied APIs. So, how to get this configuration exposed?

Unfortunately the answer is, while not hard, messy. Thankfully, Jackson's code is pretty open to extension. This is necessary, because to get your DumperOptions configuration in, you have to intercept where this option class is created, and you have to do that in an override of YAMLGenerator. But... YAMLGenerator is constructed by YAMLFactory, so you also need to intercept THAT in a subclass which overrides the creation method, in order to plumb through your custom YAMLGenerator. Messy, but doable. 

Here's what mine looked like:

val mapper: YAMLMapper = YAMLMapper(MyYAMLFactory()).apply {
registerModule(KotlinModule())
setSerializationInclusion(JsonInclude.Include.NON_EMPTY)
}


class MyYAMLGenerator(
ctx: IOContext,
jsonFeatures: Int,
yamlFeatures: Int,
codec: ObjectCodec,
out: Writer,
version: DumperOptions.Version?
): YAMLGenerator(ctx, jsonFeatures, yamlFeatures, codec, out, version) {
override fun buildDumperOptions(
jsonFeatures: Int,
yamlFeatures: Int,
version: DumperOptions.Version?
): DumperOptions {
return super.buildDumperOptions(jsonFeatures, yamlFeatures, version).apply {
defaultScalarStyle = ScalarStyle.LITERAL;
defaultFlowStyle = FlowStyle.BLOCK
indicatorIndent = 2
nonPrintableStyle = ESCAPE
indent = 4
isPrettyFlow = true
width = 100
this.version = version
}
}
}

class MyYAMLFactory(): YAMLFactory() {
@Throws(IOException::class)
override fun _createGenerator(out: Writer, ctxt: IOContext): YAMLGenerator {
val feats = _yamlGeneratorFeatures
return MyYAMLGenerator(ctxt, _generatorFeatures, feats,_objectCodec, out, _version)
}
}

This is pretty gross. It's overriding some internal methods in YAMLFactory and YAMLGenerator, but thankfully, it's only two classes, and not terribly deep into the mess.  As a result, I managed to make use of the FlowStyle.BLOCK option, which fixes a known problem in Jackson's YAML handling, where instead of this:

employees:
 - name: John
   age: 26
 - name: Sally
   age: 31

you get this:

employees:
 -
  name: John
  age: 26
 -
  name: Sally
  age: 31

There are other formatting niceties that you can do with SnakeYaml that aren't exposed by Jackson - no longer!

Anyway - this can always be factored into something improved.  I'm honestly not sure how easy the Jackson project is to contribute to, but this could be exposed in public APIs without too much irritation. Regardless, I'm licensing the above code with MIT license (as permissive as I know how to make it) and also available at this github gist so you can just use it if you feel like it.

P.S. I really wish Moshi did YAML.

Thursday 21 February 2019

Bazely thinking and the tale of the content-addressable cache...

Bazel is awesome.  It does a lot of things to ensure hermetic builds, etc.  But part of thinking in these terms means that the bazel engineers think about certain problems from a different point of view.  I discovered this in trying to debug a weird problem where my machine was working and a colleague's was failing both on a clean checkout.

It all came down to the content addressable cache.

So... what the heck am I talking about? Let me set the stage.


Misleading cache hits


I was working on a bazel conversion experiment (we're looking at migrating to bazel, but needed to try it out in a limited scope).  It uses kotlin and the kotlin rules require downloading the kotlin compiler, located on github.   The default in the bazel kotlin rules is 1.2.70 (at time of this writing), but I wanted to pull in a different version.  All well and good.  You set the version, you put in the sha256 of the file, and ... go.  On my machine it worked flawlessly.  On my colleague's machine, it dutifully downloaded the binary, and then threw a fit over a bad checksum.   I repeat... we had exactly the same checkout. Same environment... we thought.

I dug around and finally found out that (a) I had put in the wrong sha256 - the one from the default 1.2.70 version, and (b) it was satisfying my request, not from the network, but from a machine-wide "content-addressable" cache.  On my colleague's machine, it had never downloaded any files before for bazel, so it tried to satisfy it from the network and choked when it fingerprinted the file.  And (c) after digging I realized that it was supplying 1.2.70 when I asked for 1.2.71, because the content-addressable cache indexes by the hash, and the URL has no part in the cacheing.  It literally only cared about the name when it wrote the file contents from the cache into my build working directory.

Wait what?  The URL played no part in the cache index?

Must be a bug


I originally went to work up a repro-case, thinking that this is a clear bug.  I asked for http://github.com/blah/blah/blah/blah-1.2.71.zip, and it gave me 1.2.70, because I had put the wrong sha256 hash in.  It should have caught my error!  Bad bazel.  How dare it assume I knew what I was doing.  That's not safe infrastructure.   And then it hit me.  Bazel and I were thinking of the world differently.  The key was in the name "content addressable" (which I will call CA from now on, sorry Californians and Canadians). 

So, bazel was putting this in the CA cache because it was literally saying - the content is they key.  Whatever the file name is, if the content hashes to <some number>, then any request for something with the same number must want the same content.

I, on the other hand, was thinking in URL-centric terms. I wanted the file found at that location (I thought), and I supplied the hash to verify.  These aren't incompatible world-views for the most part. Usually I actually do want the content, I just assumed it's going to be downloaded from that location.  It's only around this error-handling question that they diverge, and when I consider Bazel's perspective on wanting to create fast, hermetic builds.

Bazel simply assumes that you mean it when you put in the expected sha256.  Bazel assumes you're not just naively cutting and pasting.  It'll check the first time you go to download it, but in this case, I used a perfectly valid sha256 hash for which it had a valid file.  And it served it. From the cache that is... addressed/keyed by the content of the file (or at least its hash).


Is bazel right here?


Yes and no.  This is a tricky thing - Bazel is using sha256 hashes to make sure builds are repeatable and immutable (same inputs lead to the same outputs) and this is in service of both security and performance. Bazel thinks "same inputs, same output", and partly doesn't give a crap where the content came from.  For most downloading rules you end up being able to give it multiple URLs, and it'll take whatever one it can use, as long as the content's hash matches.  Even there, it's not seeing a canonical "location" as the key, but the content itself.  It's largely only my prejudices that led me to assume otherwise.

To what benefit?  Well, assuming I don't make an error on my side, there are quite a few.  For one, download poisoning is harder in a variety of ways out of scope here. Additionally, anything addressable by sha256 can now be downloaded once, and never downloaded again from the wire, even on a clean build (since it isn't "dirtied" because the hashes are the same).  This can lead to a lot of benefits in continuous-integration machines where many projects downloading the same files over and over can avoid them. It also provides some solid ground for building distributed caches.

I went to file the bug, but have decided that this, surprisingly, is a feature, not a bug.


Thinking more bazely


Now that I have adjusted my expectations, I can think about the content (or the hash) as the unit of account, and things like URLs as the fetch mechanism for satisfying the content if it isn't already supplied.  This both should help me not make this particular mistake again, but also helps me understand a lot about the design decisions of the tooling. 

I realize that, in retrospect, this might seem obvious.  It seems that way to me, too, now.  Thinking about the web, privileging addresses not content is pretty easy.  It's hard to quantify or express in words, but a lot of subtle things that bugged me about Bazel's (and Starlark's) design and idiosyncrasies have smoothed out in my brain because of this simple perspective adjustment.  Hopefully it helps others reason more effectively about such things as well.

Wednesday 10 October 2018

Suiting up

I named this blog (and my tech-focused twitter account) GeekInASuit because I had spent a good chunk of my career in the financial industry doing technical architecture, and other technical consulting, and was the person who could talk to the nerds and the suits.  I wore a suit, and so signaled in a way that finance folks would talk to me, but also had earrings, long-hair, and otherwise "signaled" that I wasn't just a suit.   It was a great ride, and it was a lovely role, being the cultural translator, often gleaning important insights that helped clarify project details that could have been miscommunicated.  I relished that part of my career.

Then I took a job at Google, and everything changed. My role was decidedly technical, but all my customers were also technical.  I lost the primary purpose of geekinasuit, and even more, Googler culture really venerates the geek, not so much the suit.  To be honest, I kind of got shamed out of the suit. Lots of social pressure was applied, as well as a nearly endless supply of t-shirts - I swear, 20% of my compensation, by weight, was t-shirt.  So I relented, and spent most of a decade wearing jeans and t-shirts, usually with nerdy slogans or fan-service.  While in one sense, it didn't matter, I had come to like dressing up a bit.  I enjoyed taking a bit of time for self-care and grooming beyond simple standard hygiene.  So I was sadder than I realized, when I finally accepted it, and stopped upgrading my wardrobe as jackets and shoes and pants succumbed to wear and tear (and got a little tight, I admit).  It was with sadness a couple of months ago that I realized that my last suit was actually no longer going to fit me. The lining was worn out, and the wedding I was preparing to fly to would require actually buying a new suit.  I had let things go that far.

That wedding coincided with my departure from Google/YouTube. I left for a variety of reasons, moral, financial, emotional - I left with some sadness and wistfulness, but also with a sense of maybe getting back to myself.  I had gone down a deep hole in Google, lost a lot of professional contacts, reduced to only Google's technology stack. I mitigated it with doing a lot of my work in open-source, but it certainly wasn't a life I had led before, connecting with colleagues at conferences, serving customers more directly, and working with the technologies most of the industry uses.  But also... I stopped appreciating having my identity swallowed by the behemoth that Google had become.  Don't get me wrong - there are lots of good people and ideas and challenges at Google. I did work there that I'm very proud of (the Dagger and Truth frameworks, for example) but it also took me over, in many ways.  So I left, to join Square, and help in their mission of economic empowerment (by helping them scale up their development).

And I suited up.  I decided to restore at least that, even if just as a symbol to myself, to be better, to push myself, to care for myself, as superficial as clothing and appearance are.  So far, I've been suited-up 95% of the days I've been in the office, and it feels really good.  It's a state-change in my brain, segmenting a work mode from other modes, and it oddly helps me stay focused (pretty necessary in the open-office wasteland that characterizes basically every tech company for some reason).

I am, once again, a geek in a suit.  And I love it.


Thursday 17 November 2016

WTF does "vend" mean? A terminological journey with no clear ending.


So there I was, in the middle of a code review, adding a method which (in the language of dependency-injection) "provided" a value into a managed object graph with a different key, in advance of a migration.  Doesn't matter why.  But the word "provided" is so frequent that I went with a different word that (in my brain) seemed to mean the same thing: vend.

Now, the root of this little language odyssey is simply that I hate repeating words unnecessarily.  If you use dependency-injection, the word Provider is vastly overused thanks to Guice's Provider<T> interface, the JSR-330 which standardized it, and it's baby brother Dagger and other frameworks which adopted the standard terminology (Spring, J2EE, Tapestry, etc.)

Since the API involved was a method annotated with @Provides and the method was called provideBlah() (real method name changed to protect the innocent), and I just wanted some variety in my life. So I described the change this way:
Vends a [redacted] into the dagger graph, in advance of an API change where [redacted] will consume that in place of [redacted].  Part of a migration to make [redacted] require fewer assumptions (and fewer build deps) of its consumers.
Could have been "supplies" but I didn't want to imply the Supplier<T> interface, which is a thing.  I went with "to vend".

In that context, I got a drive-by comment.

▾ 
someuser
3:24 PM, Nov 16
What does "Vend" mean in the context of this change?

I was doing a cleanup using some of our awesome google-made bulk refactoring tools (notably Rosie), so this was one of those "dammit, why can't you just approve my change and let me get on with my life moments."

At first, I just went ahead and answered:

▾ 
cgruber
3:31 PM, Nov 16
> What does "Vend" mean in the context of this change?
Provide into the graph.

Not to be disuaded, "someuser" pressed on:

▾ 
someuser
3:56 PM, Nov 16
Normally, "vend" means to sell ...

Ok... gonna try again to avoid the digression and nerd-sniping...

▾ 
cgruber
3:59 PM, Nov 16
Vend also implies supplying, and I was trying not to overload the term "provide" because in this context, "provide and bind" are both apt terms. Regardless, I've updated the description.
And vend only means sell in a societal context of capitalist voluntary exchange. I can't imagine it would mean sell in the context of an API. :)

This last paragraph was a total indulgence on my part, and the result of my having mainlined (liked, literally injected into my arm) econ textbooks and treatises for the last few years.  And obviously my big mistake in the effort to avoid being nerd-sniped.

Not to be so easily dismissed... "someuser" decided to call me out.

▾ 
someuser
6:17 PM, Nov 16
<nit-picking-mode>
I agree that "vend" implies supplying something, but I have only seen it in the context of a sale. With all due respect, can you point to a definition of vend that means to supply or provide, that is not in a "societal context of capitalist voluntary exchange"? (I'm actually really curious, as in, I quite often read etymologies of words. :-) )
> I can't imagine it would mean sell in the context of an API.
That's why I was confused :-)
</nit-picking-mode>
Thanks for changing the description though :-)

Oh, it's on like the break of dawn, now.

I started looking. I couldn't find definitional resources (but I knew this was a particular inflection of use in the context of computer software and API design.

I started with a web-search of the terms: "api which vends a type" just for starters. I was not disappointed. Nine relevant examples in the first three pages.

Then I got philosophical, noticing that the code examples in which APIs which were described to "vend" things seemed to always be in Objective-C or Java sources. I started to think back, way into the early days of my career, steeped as they were in NeXTSTEP, and wondered whether there was a connection.

Here is what I replied.


▾ 

cgruber
9:29 AM
No, but I can find examples of its usage in tech, from which I apparently have picked it up over a couple of decades:
Some of these you have to ctrl-f/cmd-f and search for "vend" as they're not in the description but in comments. Also, in some cases it is synonymous with "supplies" (as in via the return type) and in other cases with "offers" in the sense of exposing an API):
https://github.com/attic-labs/noms/issues/2589
https://github.com/realm/realm-cocoa/issues/3981
http://stackoverflow.com/questions/37128296/rest-api-oauth2-type-authentication-using-aws-cognito/37141020
http://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/objectbank/DelimitRegExIterator.html
https://framework.zend.com/apidoc/1.12/packages/Zend_Pdf.Fonts.html
https://docs.oracle.com/javaee/7/api/javax/faces/render/package-summary.html
https://vaadin.com/api/7.5.7/com/google/web/bindery/requestfactory/shared/InstanceRequest.html
https://jeremywsherman.com/blog/2016/07/23/why-im-meh-about-json-api/ http://liftweb.net/api/25/api/net/liftweb/http/LiftRules.html
(sourced from the first few pages of the google query "api which vends a type")
I have a hypothesis: This language originated in the NeXTSTEP community (of which I was a part), and entered into the MacOS/iOS community lexicon from that source, and also into the Java community by way of a lot of NeXTSTEP folks joining Sun and related Java-oriented enterprises (at one point Javasoft was 25% populated by former Lighthouse Design people, of which I was one). So I suspect I picked it up early, but it is a very uncommon (as I find out in researching it) usage... but not purely in my head. :)
Small addendum... a straw poll of my team which includes iOS developers as well as android developers suggests that it is vastly less common than I would have imagined from my own biases. That doesn't discount the above links but provides a bit of a ratio, a denominator for the numerator of anecdotes I cited above. Seems like a fringe usage, and sadly, provides no insight into from whence this minority usage actually derives.

How common is this in use behind the walls of corporate secrecy? Doing an internal code-search I see a handful of examples with a cursory scan of initial results - all  in API docs with this usage of "supply".  It at least seems that I'm not entirely out of my mind, or at least others share my heterodox usage.

So now I'm damned curious. How did I pick this up?  I see these examples of the usage - is there a common source?  Did we all pick it up from one place, or did we independently start using it the same way?

If any of the three people left following this blog have a clue about this, I'd love to hear more insights.

Wednesday 1 June 2016

The API apocalypse (APIocalypse?) is deferred, as the jury found that "Google's use of the APIs structure, sequence, and organization fell under fair use."

Google had this to say about the matter.

From here on in, I don't want to talk about the particulars of the case - I'm a Google employee, but my opinion is rooted in both copyright critique and my views as a software engineer.  In fact, I used to work for Oracle, and probably would have resigned over this, to be honest, if I still worked there.

I'm not happy about this.   Don't get me wrong - it is a partial victory for sanity in software.  But it leaves the travesty of applying copyright to APIs unanswered (though I'm not sure, since IANAL, that this case could have resolved anything about what I'm concerned about, since it's a lower-court ruling about specifics).

The issue for me is this: a higher court found that APIs (application programmer interfaces - or in layman's terms, the specification of how to talk to a software library) are subject to copyright.  Fair use means only it's a legitimate exception to what is otherwise copyright-able.  So while I'm glad that this was a legit exemption, the whole underlying theory is problematic... and that has yet to be fixed.

Having APIs (and their "structure and organization") be copyright-able in theory at all is insane.  It's like saying "English grammar is subject to copyright, but if you want to talk to an English speaker, it's cool to use that grammar and lexicon - it's fair use."

No... it's not merely fair use - it's the entire point of language.  If you have to use different names, that's like saying "you can speak English, but you need to use different words when you speak".  That means you're using a different language - specifically defeating the whole purpose of using...well... language.

An API exists for the express purpose of providing a means by which one piece of code can speak to another piece of code.  A public API (which one must rely on to write software on a language like Java, or using someone's supplied libraries) is a protocol - a language (or a specific subset of language, anyway - a local slang, if you will).   Letting us copyright an API strains credulity to the breaking point.

So I'm not convinced it's all over and our industry can breathe again, but at least we can catch a short-breath (until appeals happen, and until this fair-use exemption is used as precedent in other cases).

Wednesday 9 March 2016

Keep maven builds safe from "M.A.D. Gadget" vulnerability

Coming out of blogging retirement to point at a rather big issue, and to contribute to solving it a bit.

Per this blog article from Nov 2015 there is a rather large security vulnerability observed within the apache commons-collections library versions 3.0, 3.1, 3.2, 3.2.1, and 4.0. In the spirit of the fact that vulnerable classes are called "gadgets", a colleague of mine referred to this as the M.A.D. Gadgets bug. In essence, classes which reference Apache commons' vulnerable versions and perform serialization can effectively turn the entire JVM into a remotely exploitable exec() function (metaphorically speaking).

While people are busy swapping out vulnerable versions for newer ones, the way dependency graphs work in automated dependency management systems like maven, ivy, gradle, etc, is that a project might be obtaining vulnerable versions from a transitive dependency. That's bad bad bad. So, apart from updating deps, it's important to guard against a recurrence, and you can do that, at least in maven, via the maven-enforcer-plugin.

I threw together this github gist with an example configuration that can ban this dep from your deps graph, including (most importantly) inclusions via transitive dependencies that you didn't even know you had. Here is the gist's content (I'll try to keep both updated if I change them).



<project>
  <!-- ... -->  
  <build>
    <plugins>
      <plugin>
        <artifactId>maven-enforcer-plugin</artifactId>
        <executions>
          <execution>
            <goals><goal>enforce</goal></goals>
            <configuration>
              <rules>
                <bannedDependencies>
                  <excludes>
                    <exclude>commons-collections:commons-collections:[3.0,3.2.1]</exclude>
                    <exclude>commons-collections:commons-collections:4.0</exclude>
                  </excludes>
                </bannedDependencies>
              </rules>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

So... fix your projects... but also make them better. Throw this into your project's parent pom, and then it will give you a build-breaking knowledge of whether you're vulnerable, and you can update your deps or use dependency exclusions to prune out any found occurrences you get from your upstream.

Edited to include all affected versions, and re-published from my original article in March, because blogger.com's url management is frustrating.

Tuesday 20 August 2013

Wow, I suck at blogging.

I just discovered, because of a bug filed on the Guice project, that my blog DNS settings were pointing at a domain parking page. Whoops! When I made the transition from Canadia to Amurika this year I totally neglected to fix up some domain settings. My bad.

But it sort of highlights that I don't have a good discipline around blogging. Examining this, I see about ten draft posts in my blogger account, that I never got around to posting, and which are now sort of obsolete and irrelevant. I'm not so narcissistic to think that everyone cares what I have to say but I provide zero value by never writing at all. :( So, sorry for that. I look with irony at my post of a few years ago stating that "I'm totally gonna blog again, now, promise!" Apparently not.

Life has been crazy - being at Google is a whirlwind. It's exciting, stressful, but also charming. You get very much "dug in" in certain ways, but most of those ways aren't awful - they just occupy your attention.

What IS wonderful is that I've been able to work on primarily open-sourced projects, being a part of the the java core-libraries team. This has meant working on the google core libraries offering Guava, dependency-injection frameworks such as Guice, and Dagger, as well as a host of smaller projects such as Auto (let robots write it!) and contributing to my little testing/proposition framework, Truth. Seeing these things evolve, sometimes in response to each other, has been wonderful. And I get paid to do it! Why? Not because Google is purely altruistic, though Googlers seem to have a really strong bent towards contributing back. But these things really help Google develop world-class software at-scale.

I was in a little internal un-conference of core-librarians of the various supported languages, and my boss pointed out that the fact that we HAVE core libraries and tooling efforts like this is a major contributor to Google's competitiveness and capability. We fund people to figure out what patterns people use and what code developers re-write on every project and creating hardened, efficient, common implementations/APIs/frameworks out of them, where appropriate. We don't try to re-use everything, but we dig and see where we can "maximize the productivity of labour" (to borrow from the economists) of our colleagues by reducing their coding and testing burden to focus on the things that make their application unique from others. In short, we invest in future production, in future capacity for our developers, both in quality and velocity.

Often, we aren't writing tons of code to do it, but rather examining patterns and tweaking, deprecating certain approaches in favor of others, and using the rather stunning tooling we've evolved (blogged about elsewhere) and tools we've open-sourced, to migrate our entire code-base from deprecated forms to new ones. But we also consider new approaches, libraries, and frameworks, both developed internally and externally. It's actually remarkable (to me) that a company this big can change direction so quickly, and adapt to new realities so quickly. The joke among my team is that we're starting to be less a core-libraries team, and more of a javac-enhancement team, since we are also doing a lot of building in static analysis and checks (thanks error-prone folks) into our tooling to prevent error at compile time as we are building new frameworks and tools.

While we've had a few false starts here and there, we are increasingly engaging in joint projects and accepting contributions into the codebase from external parties who benefit from the open-source work as well, which is gratifying. Nothing quite so happy as win-win exchanges.

All told, it's been a couple of years of full engagement, and not a lot of time to do tech blogging. But I'll give it another go, even if it's just to do updates like this from time to time. It's the best job I've had to date, and I am thrilled to be in contact with such high-quality professionals as I am on a daily basis.