NoSQL Distilled book review…

I read Martin Fowler’s (not a family relation) book ‘NoSQL Distilled’ with great anticipation. I found a few things that did not ring true for me though, so I decided to write them down. Be sure to read this post if you have or plan to read Martin’s book…

Firstly, there is the overview of the book on his website. Here he has the following quote:-

” ‘The rise of NoSQL databases marks the end of the era of relational database dominance.’ – But NoSQL databases will not become the new dominators. Relational will still be popular, and used in the majority of situations. They, however, will no longer be the automatic choice. “

I think when Martin wrote this these statements were probably accurate. The thing is though, as I’ve commented before, the sheer number of types of NoSQL database will merge together in single products. Think of MarkLogic 7 and the introduction of a triple store alongside our document store and search engine, and the fact we hold indexes as in memory column stores, and can create SQL (accessible via ODBC) views over these range indexes.

This gives MarkLogic as the leading NoSQL vendor a very wide range of data it can store, which is not true of the current generation of RDBMS.

The other problem I have is that ‘Relational will still be… used in the majority of situations’. I think this is unlikely. Way back in early 2006 there was an often quoted study that said over 80% of organisational data was unstructured. This must be higher now. This can only be handled at a fine grained (i.e. delve inside the content, not just metadata) level by using a schema less repository, with support for content text extraction, entity enrichment, search and Enterprise grade reliability.

You could argue that the current generation of ECM solution haven’t displaced RDBMS, so why would NoSQL? Well, all ECM vendors use spinning disc folders for content and RDBMS for meta data – so they inherently rely on RDBMS technology. Also, they rarely ‘understand’ the structure of what they are capturing. Like RDBMS they don’t understand XML formats automatically. In effect you’re using a pre-defined schema for metadata, and a dumb BLOB for the content. You’re not inherently understanding the internal structure (if any) of a document, or providing access to all of its content at a fine grained level.

NoSQL systems, and MarkLogic in particular thanks to its 200+ binary filters and support for XML, JSON, text and binary files, do provide this. For the first time organisations have within their grasp a database with all the Enterprise features they’re used to from the RDBMS world, but in a single product separate from RDBMS. They finally have the capability to store and understand all their content.

Now they can do this, they will eventually all use NoSQL databases for this reason, at least somewhere in their organisations. The only significant barrier at the moment is the lack of applications built solely on NoSQL to enable this. This is changing though, with ECM and Search platform independent vendors now also supporting NoSQL stores underneath their applications.

So, my question for Martin, and indeed the rest of you, is this: If customers can now managed 80%+ of their data in NoSQL, why wouldn’t these software solutions take 80%+ of the database market to become the dominant technologies by 2020?

“There are many features of relational databases, such as security, that are less useful to an application database because they can be done by the enclosing application instead”

Security done in the application. Seriously? Go in to any public sector organisation and meet their security engineers. Especially in the UK. You won’t last 5 minutes in the room without being ripped apart. MarkLogic probably has the strongest and most developed security infrastructure of any NoSQL database. We built it in to the product from the start. This is because of all our USA government customers.

Even though we’ve proven this time and time again I still get the Spanish inquisition from security engineers, and rightly so. It’s their job to make sure the infrastructure is secure, and secure in depth. You can’t leave this to the whim of an application developer.

I would also argue that application developers suck at this. They’re thinking about features, not security. Just look at Microsoft applications and the bugs in those that leave the entire OS vulnerable.

Leaving security, amongst other things, to application developers is a recipe for disaster.

The myth of polyglot persistence

This is probably going to sound a little bit heretical. It is a commonly held belief currently that we’re moving to a world where an application will access several data stores in order to efficiently manage all of its information for a single application. This is termed ‘polyglot persistence’ and has been popularised by Martin Fowler and others.

I would argue, however, that this is an affect of the current NoSQL software market rather than a motivation for the development of that market. Take this simple example. I’ve had a couple of chats with customers where they’ve used a relational system for data persistence, then something like MongoDB for a caching layer. Mongo would admit they don’t do full support for ACID transactions, and that MongoDB can lose your data entirely. A caching layer though can be flushed if needs be, so not a big issue.

This sounds all perfectly reasonable and dandy. Until you consider the following question. “If you could provide a system to your customers, in a single product, that was ACID compliant, highly scalable, and could handle both these tasks, would you consider using that instead?”

You immediately start to think of all the plumbing you’ll have to create between the two systems you were originally considering, so of course the answer to the above is “Well, Yes”.

If you take this argument further and apply it to different data models rather than different functional aspects of the system, then consider this question: “If you could handle aggregate/document (XML/JSON/other), triples, in memory column stores for fast analytics, and structured information, in a single product, would you consider it?”

You may be standing there not quite believing that person, because it’s such a new concept to get your head around – one database to rule them all, in fact – but of course the answer will be “Yes, I’d consider it”. (…But you’d best damn well prove it!)

Once a NoSQL database proves this over and over again (give us at MarkLogic a couple of years!) then the polyglot persistence myth will start to die… And Larry Ellison may start crying in to his Cheerios.

“The document model makes the aggregate transparent to the database allowing you to do queries and partial retrievals. However, since the document has no schema, the database cannot act much on the structure of the document to optimise the storage and retrieval”

This depends massively on the document type. In many applications there is a tonne of XML flying around. There are also many situations now with JSON or CSV, or some other document with some ‘structure’, even though the meaning of that structure (E.g. <placename>Sheffield</placename>) is not understood by the database.

MarkLogic stores documents as a compressed tree structure, and has a universal index for the structure (element namespace and names), text (stems, word, etc.), and values. This means that even though you’ve not associated a schema with data types, the entire document is in fact indexed for full text, and which elements exist within those documents, and the value of those elements. So if the application understands how to structure the query, then even though the document database itself doesn’t know what a <placename> is, you can still query for documents with a particular placename value.

In effect, don’t buy / use a document database if it doesn’t enable you to store ‘as-is’ and give you some help on the query side. No other NoSQL database currently does this, I believe, because a search engine is plugged on top as an afterthought, and is wholly reliant on the database telling it what structure the document has. This is a unique selling point of MarkLogic.

Conclusion

Although it may not be obvious from the above, I really, really like this book. It will certainly help a lot of our customers understand the types of NoSQL database out there and the concepts underpinning their design. Unfortunately Martin has seemed to just accept the mantra of the NoSQL movement generally. The above quotes are illustrations of that.

I always like to ask “Why not?” to any tech religious statement. (And especially of people who believe their own marketing) Often the answers reveal the most interesting aspects of the technology. This question helps you find its boundaries, where it is and not applicable, and start to lead your mind in to a world of possibilities.

I think the drawback to the NoSQL world generally is that there is only one commercial software house – MarkLogic. Sure there are services consultancies that will sell you an ‘Enterprise version’ with a few extra features, but it’s hardly Enterprise software sales – their profits are made through consultancy, not adding lots of great features to resell. A lot of commentators have concentrated on the status quo of the Open Source movement. There’s a lot of innovation out there in private companies on mission critical applications. MarkLogic is being used to power them.

To accept the NoSQL hype as truth is to miss a great deal of innovation, and potentially to waste a lot of your time and effort. Question everything.

Hello. First thanks for your comment. Always good to hear from people. It’s not surprising you haven’t heard about us from a NoSQL conference. We actually don’t attend many of those events. We have customers in Germany that sing our praises, including Springer. You’ll also see us at the Book Fayre every year. In Europe most of our customers are in Media, so we go to Media events.

I disagree that MarkLogic is only tapping in to the buzz around NoSQL (but thanks for calling us a ‘Big Player’! Certainly in revenue terms in the NoSQL space we are). Our product has been around for 12 years, twice as long as any other Open Source NoSQL product. These are currently not Enterprise ready. Experience has shown that when a system is mission critical (as opposed to a simple caching layer) then customers want the reassurance of a product with Enterprise class features, like MarkLogic.

NoSQL suffers from being a really bad term for anything. You correctly identify it as a movement – toward horizontally scalable databases on commodity hardware. Just so happens there are 4 core types of NoSQL database, one for each non-relational data type: Graph, Key-Value, Columnar and Aggregate (aka Document). Although a schema-agnostic approach is common, non-aggregate stores still require data mappings to be built.

Really a better term for the products, rather than the underlying architecture, would be a ‘natural model database’ – where the content is added as-is, with no up front schema definition. Its really a coincidence that they feature horizontal scale out. I’m sure if you were to create an RDBMS from scratch these days that this would be a feature of such a product too. (Indeed there are already projects doing just this.)

Again thanks for your comment, and I hope you add comments to future posts too.

3 comments

donald butler says:

May 6, 2013 at 20:44

Couldn’t agree more – ECM is outdated approach

ayache khettar says:

May 7, 2013 at 12:12

I have recently attended NoSQL conference in Cologne (Germany) and there was not a single word about MarkLogic. I don’t blame you writing such article cause you’re coming from a Marklogic employee perspective. Big players like MarkLogic, Oracle, IBM, Microsoft will tap in into the Market share of NoSQL world, but how successful would they by that remains to be seen. NoSQL is movement not a product and it is driven by open source community. Small and medium companies will be more likely to try these open source products to see if they fit their uses cases, rather than buying a licence to try something out.

Good luck to you guys.

1. adamfowleruk says:
  
  May 9, 2013 at 16:24
  
  Hello. First thanks for your comment. Always good to hear from people. It’s not surprising you haven’t heard about us from a NoSQL conference. We actually don’t attend many of those events. We have customers in Germany that sing our praises, including Springer. You’ll also see us at the Book Fayre every year. In Europe most of our customers are in Media, so we go to Media events.
  
  I disagree that MarkLogic is only tapping in to the buzz around NoSQL (but thanks for calling us a ‘Big Player’! Certainly in revenue terms in the NoSQL space we are). Our product has been around for 12 years, twice as long as any other Open Source NoSQL product. These are currently not Enterprise ready. Experience has shown that when a system is mission critical (as opposed to a simple caching layer) then customers want the reassurance of a product with Enterprise class features, like MarkLogic.
  
  NoSQL suffers from being a really bad term for anything. You correctly identify it as a movement – toward horizontally scalable databases on commodity hardware. Just so happens there are 4 core types of NoSQL database, one for each non-relational data type: Graph, Key-Value, Columnar and Aggregate (aka Document). Although a schema-agnostic approach is common, non-aggregate stores still require data mappings to be built.
  
  Really a better term for the products, rather than the underlying architecture, would be a ‘natural model database’ – where the content is added as-is, with no up front schema definition. Its really a coincidence that they feature horizontal scale out. I’m sure if you were to create an RDBMS from scratch these days that this would be a feature of such a product too. (Indeed there are already projects doing just this.)
  
  Again thanks for your comment, and I hope you add comments to future posts too.

Share this:

Related

Leave a Reply Cancel reply