I get asked a lot to explain about practical data modelling in NoSQL. In this blog entry I compare and contrast the well understood world of relational modelling to that of an aggregate NoSQL database like MarkLogic…
Tables vs Aggregates
The first major difference between MarkLogic and Relational systems is that MarkLogic is an aggregate database. This means rather than storing tables and defining relationships between their rows, we instead construct a document to hold a cohesive set of information. In the case of a website that accepts orders, we may store an entire order as a single document.
In the relational world we may instead store this as a table for Order, one for OrderItem, one for Products, one for Address, one for Payment, one for Account. We would then use a series of joins to fetch the relevant information for each particular operation.
Aggregates have the advantage in that they are more logical to application developers who tend to think in terms of object graphs. Typically when working on an order you would want access to many of the above ‘objects’. E.g. for a ‘order details’ screen. Many a project have thought ‘ah, lets add another field here’ and had to update their SQL as well as their application level code (as the SQL originally didn’t pull back ‘all’ fields). With aggregate databases you don’t have that problem.
Search vs. Query
In the Relational world you use where clauses to restrict which information is brought back from one or more tables. You use relationships and joins to pull back related sets of information. In XQuery you have ‘where’ clauses too… but you tend not to use them! You instead use predicates so the information is not mapped (i.e. pulled) in to memory to then be evaluated. Instead the predicates resolve against an index rather than touching the documents. (The same is true with cts:*-query as you’ll see below)
Some NoSQL databases suck at this. They only allow you to query by a single field. You then have to process, in code, the information to restrict what is returned. I find this utterly amazing in the 21st century. They also don’t come with search engines for full text search, or have advanced functionality like geospatial queries.
MarkLogic was built with search in mind. This means rather than index your data in the database, then have a separate (and duplicated) set of indexes for a search engine you instead have a single index to query against. This makes incorporating property-value and textual searches a breeze. You can even construct a single and-query that includes this: “Give me all documents with a synonym of ‘destroy’ in that were created in the last week that mention a place within a 10 mile radius of this lon/lat.” And you can do this within a single range index resolution. I.e. without querying then filtering the information in application code. It’s fabtastic! (In fact you can do even more sophisticated things, but the English sentence structure to go with it would be very convoluted!)
It’s fair to say that you can stick a text search engine on to any NoSQL or indeed any relational database. None of these have MarkLogic’s built in capability though, and none can do it within a single installed product like ours can. Certainly none can combine the above query like MarkLogic can, or as efficiently. If you have complex query needs over a massive (potentially PetaByte) database, you should look at MarkLogic. (Especially if you want to do this on commodity hardware)
Denormalisation vs. Normal Forms/Relationships
I remember being tortured with Normal Forms in my University database classes. Never store information in more than one place! Use primary and foreign keys! Gaaaaaah! To be fair, they do serve their purpose in a Relational database and save you effort down the line. It does require a lot of up front data modelling though. With MarkLogic you can just ingest XML (or indeed XHTML or JSON-as-XML) data – as-is – and then query it. If the Schema changes just update your query, not the database.
This is great for data storage. Until you need to do a query over a set of tables with some of the query terms being in each. This quickly gets very hairy in relational systems. So much so, that it is not unusual to define a whole host of views to aggregate data together, usually to improve query performance. This of course costs disc space. No-one is particularly bothered though as if you need the speed, you eat the space.
This is also true of MarkLogic. Occasionally you need to query across document boundaries. Consider the above example. You may want to summarise all orders being sent to customers who have been registered for over 5 years. In this case you may create a meta-document just for efficient queries. In here you store the data you query about, and pointers to the documents (their URIs or application IDs). You would push in order ID, order item ID, order item category, order user id, order user-years-member, and any other queryable data.
You do this because the query response object you want is ‘Order Users’ so you can look at data across those entities. In this case creating a meta document denormalised to the ‘Order User’ level helps. This means you can do a simple predicate or cts:search over those meta documents, rather than a complex join or XQuery within XQuery within XQuery. It is the NoSQL equivalent of a relational view.
So you may have user information stored in that database 10 times, in 10 meta documents. This is why it’s called denormalisation. It’s a strange concept to get used to if you’re from a relational mindset, but it has several advantages.
Firstly, it makes query code very, very simple. Everything becomes an and-query rather than a load of XQuery.
Secondly, if you want to do a query like “Show each user’s most ordered item, ordered by item value” then it becomes a “Query – order by – filter – order by” rather than a nightmarish set of SQL or View creation. You do a query that pulls back all User Orders ordered by ascending username ascending, then descending order quantity. You then take the first result for each user (i.e. filter by first-username-encountered, to give the largest quantity item per user). You then sort the remainder by order item price, descending.
You can do this within a single index resolution using a User-Order denormalisation. You can’t do this efficiently in a relational db unless you create a separate, query specific view (you’d need some god awful ‘IN’ query). You also can’t do this by a simple aggregation in any aggregate database without actually executing multiple queries (even just under the hood). With MarkLogic you use the User-Order denormalisation indexes in a single index resolution then add the document reference to a map:map() instance. This doesn’t yank the document in to memory at this stage, but instead holds a reference, making the subsequent filtering and post-sorting (and perhaps also result paging) very highly performant.
It takes a bit of skill to get to this point. (And indeed one could call this up front data modelling! Shock horror. I thought we were schema agnostic? There’s a time and place for modelling in NoSQL). The point is you’re not altering the original documents – you’re just creating an efficient meta-document full of references and indexed data that solves a variety of complex query requirements. These meta documents are also small because they don’t try and duplicate all the data, unlike a relational view which will index all the data that needs returning from the query again.
As a corollary I’d like to point out that I actually like SQL as a query language. It’s not SQL’s fault that the relational model forces you to bake some truly awful joins and ‘IN’ queries in to your SQL. It’s a result of your model. On many occasions you may want to create a relational data table view for analysis. E.g. by a BI tool. This is a damn sight more desirable and easier than updating every BI tool for every type of aggregate database, or writing horrendous SQL code.
In the relational world you often have to create a separate data mart anyway, which is wasteful of resources. In MarkLogic we efficiently store indexed properties in an in-memory columnar database fashion. This makes resolution across fields very efficient and also horizontally scalable via multiple nodes running Map-Reduce style query processing.
What you can also now do in MarkLogic 6 is define a relational view from these indexes, and make that available via an ODBC connection. This gives you the best of both worlds. You can create relational views for your SQL / BI tool users, and aggregates for your application developers. These aggregates can be created automatically using a pre-commit DB trigger too, so the denormalisations and main documents are always in sync (“Death to eventual consistency!” cried the masses). The real benefit, though, is doing this in a single system with a single set of indexes. No more need to wait 24 hours until the overnight job loads the previous day’s trading activity in to the BI data mart. Instead it’s real time.
Hopefully the above has given you an insight in to a couple of interesting data modelling use cases and solutions. They should certainly help you identify where you may apply a NoSQL solution, and in particular MarkLogic. They should also show that it is sometimes a good idea to have a look at your queries, and think about those first when choosing how to store your data. This is as opposed to storing them in the most ‘efficient’ (normalised) way possible. Indeed, the above examples should should you that relational systems are not as normalised as you think, once you take views and datamarts in to account. Think “returned query result objects first, storage model second”.
If you have another use case or problem you’ve found in the relational world that causes a lot of application code or complex workarounds in modelling then please leave a comment. You never know, it may make a good use case for an aggregate database like MarkLogic.