Sunday, 12 April 2009

Re-thinking Object-Relational Mapping in a Distributed Key-Store world

JDO, JPA, Hibernate, Toplink, Cayenne, Enterprise Object Framework, etc., are workable object to datastore mapping mechanisms. However, you still have to optimize your approach to the underlying data system, which in most cases historically have been RDBMS. Hence these systems, their habits and idioms are all quite strongly tied to Object-Relational Mapping. Since Google's App-Engine released a Java app hosting service, with a datastore wrapped in JDO or JPA via DataNucleus this week, people have been playing, and the difficulties of these very habits have become clearer. It's easy, when using JDO or JPA with BigTable, to design as if we're mapping to an RDBMS, which a distributed column-oriented DBMS like BigTable is not. There are some good articles on the net about how to think about this sort of data store:

http://www.mibgames.co.uk/2008/04/15/google-appengine-bigtable-and-why-rdbms-mentality-is-harmful/

http://torrez.us/archives/2005/10/24/407/

http://highscalability.com/how-i-learned-stop-worrying-and-love-using-lot-disk-space-scale

I'm struggling with this myself, being a long-time (longer than most) user of Object-Relational mapping frameworks. For example, one cannot do things with BigTable like join across relationships to pre-fetch child object data - a common optimization. Keystores are bad at joins, because they're sparse and inconsistently shaped. The contents of each "row" may not have the same "columns" as each other, so building indexes to join against is difficult. We actually need to re-think normalization, because it's not the same kind of store at all.

Interestingly, OO and RDBMS actually DIDN'T have an impedance mis-match in one key area in that Entity-Relationship models bore a striking structural resemblance to Object-Composition or Class diagrams, and the two could be mapped. Both RDBMS schemata and Object Models were supposed to be somewhat normalized, except where they were explicitly de-normalized for performance reasons. With distributed key-stores, we're potentially better off storing duplicate objects, or possibly building massive numbers of crazy indices. At this point, I don't know what the right answer is, habit wise, and I work for the darned company that made this thing. It's a different paradigm, and optimizing for performance on a sparce key-store is a different skill and knack. And since we're mapping objects to it, we'll have to get that knack then work out what mapping habits might be appropriate to that different store.

Eventually we will work out the paradigms, with BigTable and its open-source cousins (HBase and HyperTable) creating more parallelism and horizontal scaling on a "cloud-y web" as I've seen it called. The forums - especially the google-appserver-java and other forums where objects and keystores have to meet - they will grow the skills and habits within the community. But we have to re-think it, lest we just turn around and say "that's crap" and throw the baby (performance/scaling) out with the bathwater (no joins/set-theory).