Data Modelling in MarkLogic, and how my mate Norm can help you!…

Data modelling in an aggregate/document NoSQL database can be unfamiliar to most. In this post I mention a couple of techniques and how they can help…

In both relational database systems and NoSQL key-value and columnar stores you have to take a logical application ‘object’ and shred it in to one or more structures that are flat. They are typically table based, with some NoSQL options supported one or more values per column, or some sort of built in hash mapping function.

Shredding poses a few problems for application developers. These mainly centre around them having to do work just to store and retrieve their information. As a lazy bunch (I used to be one, so I can get away with that comment!) they don’t like spending time doing this – they prefer to spend time on the interesting and unique stuff. This is normally an issue because organisations want these storage layers to be fast and reliable – so they force their developers (quite rightly) to spend time on this layer too.

What if you didn’t need a layer like this though? Why not just store an aggregate pretty much as is – no matter how complex the internal structure? This is where XML and JSON come in. As hierarchical data formats they can be used to easily map to and from application code. There are many techniques for just taking an object and converting to and from XML and JSON. There’s little, if any, mapping code to write, and these layers are typically open source and so have been tuned to within an inch of their life already.

This is a win-win for both developers and organisations. Developers can spend time on the ‘interesting’ stuff. Project times are cut. More profit (or less cost) is made for the organisation. Developers are happy, and organisations are happy. Also, the ‘interesting’ areas are typically your organisations specific business, or application’s key differentiators – they are your secret sauce and so it makes sense to spend more time working on that than your competition.

This is why aggregate / document databases like Mongo DB and MarkLogic are popular with developers. MarkLogic projects typically last only 3-6 months, including detailed specification and installation. The BBC’s Olympics architecture build MarkLogic system was like this, and look how great that worked during the Olympics. The longest I’ve heard of in a Phase 1 development with MarkLogic is 12 months.

Load as-is

Mapping from application code to XML or JSON is one thing. What about all the other stuff you need to store and retrieve later? MarkLogic can handle storing binary files efficiently, including replication very large (multi GB) files across a cluster if needs be. It’s logical to store all content for an application in one repository rather than several – an RDBMS for transaction info, a ECM system for office docs, a MongoDB for web app docs – it all adds to the complexity, and of course cost in maintenance and development, of a final application.

We talk a lot about loading ‘as-is’ in MarkLogic. Sometimes this is misinterpreted. I wanted to spend a little time talking about the implications here.

Let’s say you’ve got a MarkLogic system and are loading in docs from SharePoint and application information in XML. They all relate to your customers and so it makes sense to store them in a system where you can say ‘Show me everything we have about customer X’. You can happily store these in MarkLogic. We even have a SharePoint connector to make the process simpler.

Store an XML representation of everything

There is a bit of a problem though. MarkLogic search indexes plain text and XML files (and JSON that is stored as XML transparently to application code). In order to make the text and properties searchable we use over 200 ISYS filters built in to MarkLogic to create an XHTML rendering of the document. This is purely for indexing purposes. You can maintain a link in this document back to the original – just add an XML element called originaldoc or some such in the XHTML head section.

The advantage of doing this may not be immediately apparent. It looks like on the face of it you’re just using more storage for the same documents. In fact, this is highly desirable in all but the most basic applications. Most organisations, especially in the Public Sector where I work, do not allow the original document to be altered. In order to make a document searchable though you may need to know what content within it represent a customer code, or date, order number, place, post code, organisation – the list goes on.

The best way to do this is to tag those words. This is easy to do in XML because you can wrap the text with an XML element in another name space – like <ns1:placename>Scunthorpe</ns1:placename>. This gives you a consistent information tagging mechanism across all your documents – regardless of source type – that can be used for searching. Of course, doing this on the original would mean altering your binary docs, let alone having to figure out how to do this tagging differently for every type of document you store.

Much better, therefore, to create an XHTML (an XML based format of HTML) rendering and then perform this entity extraction function on them. You also then have the option of enriching this information. I could, for example, add a longitude and latitude attribute to my placename element. This would mean rather than search for a list of places I can instead search for a point and radius, or bounding box, or even draw a polygon on a map and say ‘what do we know about here?’.

This approach gives you great flexibility in enriching information as the basis of your application. It also means you can improve the enrichment or identify new entities without a rewrite of the system – you leave the original docs alone and just re-run the new extraction script against your old data. Voila! A very rich search application or data repository.

What about relational data?

MarkLogic tends to be used to aggregate related data from many source systems as an initial deployment, prior to being used as the primary data source itself. After all you have to have a working system before you can switch application code to use that system. This tends to mean we get asked about storing ‘relational data’ a lot. This could be some CSV/TSV exports, or even a direct dump of data from an RDBMS.

MarkLogic deals with aggregates – documents – and thus doesn’t need the equivalent of an SQL join. The problem with these data dumps though is that they tend to be one file for an entire table’s data, or one file per row, and are very flat in structure. Consider a relational schema with Order, OrderItem, Item, Customer and Address tables.

For a single order you may have over 10 rows of data spread across tables. Although you may want to store the originals, you more than likely also want to store a composite document that encompasses all information for an ‘Order’. You may even want to create an Order History document that holds all items and quantities a person has ever ordered. These are fundamentally tree structures that make sense to model in a single XML document.

Denormalisations

The question is how do you go about creating these composites? You can of course re-write an app to use MarkLogic directly. Long term this is the least costly approach. Everyone has legacy data though, so lets discuss how to remodel that assuming you now have a set of 10 or so flat documents containing information for a single order.

You effectively need a task that looks in this data and joins it together. In MarkLogic parlance this is called a Denormalisation. You are going from a normalised relational model to a denormalised document model. This may mean you store the same information multiple times – like Item details within Item, composite Order, and composite Customer Order History documents. The advantages though are faster query time. This is in effect the Document database version of a relational View. You create a view when a query with horrendous joins is killing query time. No real difference here.

So how to go about creating these things? You could use a batch task – but what if the originating system is in flight and you’re getting dumps periodically? Worse still, what if an Order Item document is committed (as in ACID commit) to MarkLogic before the logical parent Order document?

In this scenario you need to use a trigger or Content Processing Framework (CPF) pipeline to initiate the job. You may create one per target denormalisation doc you create. At the start of this you’ll load related information to the new doc that has been added, and do a sanity check to ensure all required information for your denormalisation is present. You’ll then test the hell out of it to make it performant.

You will of course want to be careful. You wouldn’t want to create a trigger storm accidentally where the creation of a denormalisation document causes another trigger to fire, then another – the MarkLogic equivalent of a fork bomb. (Remember these from your Computer Science University days when a friend would remotely login to your machine and execute his little special programme as you’re working on an essay? Yes Matt H I’m looking at you!)

Partial Normalisation

As well as denormalising shredded documents in to a single composite there may be scenarios where you want to go the other way. Consider our order composite document:-

<order id=”order1″>
<deliveryaddress>1 Paddington Way, London</deliveryaddress>
<orderitem>
<code>item1</code>
<quantity>50</quantity>
</orderitem>
<orderitem>
<code>item2</code>
<quantity>4</quantity>
</orderitem>
</order>

MarkLogic has an ODBC server where you can set up relational views over unstructured data. This is a great way to use your existing BI tool to query not only a relational data warehouse, but also your live, operational unstructured MarkLogic database. This works by each ‘column’ referring to a range index set up on a particular element, attribute or path in XML stored in MarkLogic.

A ‘row’ in the view represents a document (more accurately, fragment – I’ll leave that for another day) that has a value in the range index for every required ‘column’. Some of course can be nillable, and thus optional. This is great, but does occassionally lead to unexpected results. In the above document, for example, range indexes over order id, item code, and quantity would logically – to a human – result in these two rows in the view:-

order1 : item1 : 50
order1 : item2 : 4

Instead they correctly (by the mathematics involved) resolve to these four rows:-

order1 : item1 : 50
order1 : item1 : 4
order1 : item2 : 50
order1 : item2 : 4

If you think about it this makes sense. There are two values for both the item code and quantity range indexes. The ODBC view does not infer any containment from the parent <orderitem> element because it is simply a co-occurence over range index values.

You can get around this to some extent by adding a fragment boundary on orderitem. This tells MarkLogic about logical containment. This has implications though on storage and search. I won’t bore you with the details, but you have to be careful.

What you need is a halfway house between a fully normalised set of 10 or so relational flat documents, and the one order document. You need a document per logical row in your ODBC view. You need these two documents, in effect:-

<order id=”order1″>
<deliveryaddress>1 Paddington Way, London</deliveryaddress>
<orderitem>
<code>item1</code>
<quantity>50</quantity>
</orderitem>
<madeby>Generated by Adams awesome shredding script</madeby>
</order>

and

<order id=”order1″>
<deliveryaddress>1 Paddington Way, London</deliveryaddress>
<orderitem>
<code>item2</code>
<quantity>4</quantity>
</orderitem>
<madeby>Generated by Adams awesome shredding script</madeby>
</order>

(The <madeby> is optional – I was just testing static element functionality! More on this below…)

This can be done by semi-normalising the composite denormalised document. (Confused by the terminology yet? Hang in there, help is on the way!) It’s a similar pattern to before, except you partially shred the composite Order document in to Full Order Item documents.

Full On Shredding

Occasionally you’ll be storing a dump of data as a document and need to shred it. This is common in conversions from CSV and TSV files to XML, or even in a single Excel sheet if you think about it. You need to take a single document and shred it to hell. Not quite a ‘Normalisation’ pattern because you don’t care if there’s an Address1, Address2, Address3 or repeated data – you just need to handle each row as a single document.

Same mechanism though really as the partial normalisation, but simpler mathematically and bigger performance wise (assuming 10 000 rows in a data dump rather than 10 for a single transaction).

Where to start developing?

You’re probably thinking this sounds a lot of work. You’d be right. There are many, many benefits though. Also because you’re working with tree structures as opposed to lots of columns across tables there’s less work than writing mapping code as Data Access Objects (DAOs). For example, a delivery address element’s entire content can be referenced using the XPath /order/deliveryaddress/* rather than as individual ‘column’ names. This means you can add and remove data to delivery address and not have to alter your scripts.

It is a pain though. Lots of very similar code isn’t the most exciting thing to write. I’d start by thinking of what you want to have as ‘search results’ or ‘application composites’. A composite Order is a good example. Customer is another good one. Once this is done, check out your source data and create a denormalisation as required to feed this new composite.

After this, if the composite is not quite suited to an ODBC view (and indeed you need that functionality) then write a partial normalisation against that new denormalised version. This way you don’t have to start from shredded data – you can take advantage of the fact all the data you need is present in the denormalised document already.

It’s still creating triggers or CPF pipelines by coding XQuery rather than configuring a tool to help you out though. What if you have lots of these, or don’t have the testing time in your project?

Meet my mate Norm

Help is at hand! I hate giving people problems, I prefer giving them solutions. So this week whilst zipping up and down England on the train I’ve started a project to do this work for you, based on configuration in an XML file per denormalisation rather than writing triggers yourself. I’ve called it Norm (because he works both ways – denormalisation and partial normalisation (shredding) ). Let me introduce him to you.

It’s pretty straightforward to list the source documents, whether they are required or not, and how they relate to each other, and detail what your target structure looks like. It’s harder to write the code to do it yourself. This is what I’ve provided in Norm. I’ve used my knowledge of MarkLogic indexes and functions to provide a tested, performant (if you remember to remove the xdmp:log statements in production) library and samples to get your started quickly.

Here’s a denormalisation configuration for creating my Order composite:-

xquery version “1.0-ml”;
declare namespace n = ‘http://marklogic.com/norm/main&#8217;;

xdmp:document-insert(“/admin/config/norm-test.xml”,<n:denormalisation>
<n:name>ODBC Shred</n:name>
<n:description>Shred document for ODBC view</n:description>
<n:uri-pattern>/prog-avail/##s1:uri##-##auto##-odbc.xml</n:uri-pattern>
<n:collections>norm-generated,odbc-data</n:collections>
<n:enabled>true</n:enabled>

<n:template>
<n:element name=”order”>
<n:attribute name=”id” source=”s1″ source-path=”/order/@id/fn:data()” />
<n:element name=”deliveryaddress” source=”s1″ source-path=”/order/deliveryaddress/text()” />
<n:element name=”orderitem” source=”s1″ source-path=”/order/orderitem/*”/>
<n:static>
<madeby>Generated by Adams awesome shredding script</madeby>
</n:static>
</n:element>
</n:template>

<n:sources>
<n:source id=”s1″ name=”order” root-xpath=”fn:collection(“”odbc””)” mode=”create-update” required=”true” shred=”/order/orderitem”>
<n:collection-match>odbc</n:collection-match>
<n:primary-key><n:element>order</n:element><n:attribute>id</n:attribute></n:primary-key>
</n:source>
</n:sources>
</n:denormalisation>,
xdmp:default-permissions(),”norm-config”
)

As you can see it’s pretty straightforward – you list what the target doc (or docs) look like, where the data comes from, then use XPath to ask Norm (we like to ask, not order Norm – he’s a sensitive fellow) to place data in particular locations. I’ve also supported common  replacement patterns in the URI to make this easy too. I’ve even allowed static text or elements to be included.

This particular script executes against the above data in 0.013 seconds. Most of this is around logging code. Without logging code it executes in 0.0064 seconds – double the speed. Indexes are required to get this speed though. I have a helper function to show you for a particular denormalisation which indexes you will need.

How do I enlist Norm?

Install instructions are out of the scope of this document. I’ll probably record a video soon to cover it off. For now though you can head on over to my Norm GitHub page to grab the code and read the instructions and design methodology.

I’ve only handled a few basic structures so far. If you have a complex set of relationships (E.g. grandparent rather than just parent), or other specific use cases, please log an issue on GitHub with samples if possible, even just basic mockups, and I’ll add that functionality in.

Naturally I’ll be working on full documentation and test scripts before our community manager, Eric Bloch, tells me off! 8o) (Hi Eric! … have you met Norm yet? Norm, this is Eric…)

Summary

Hopefully you’ve seen a couple of useful ideas for a future unstructured data management application you’re looking to build. If there are any questions I can always be reached at adam dot fowler at marklogic dot com.

Say Goodbye Norm!!!

4 comments

  1. Hi Adam

    This is an excellent piece, written in your usual witty and entertaining format…

    I won’t feel quite so guilty now about all those hours I thought you have been “wasting” on the train recently… Evidently you’ve been putting them to good use…!

    It also answers some of my questions that I posed about how to handle structured data in MarkLogic…

    Good stuff

    Best regards

    Andrew Green
    Account Manager Public Sector
    MarkLogic Corporation
    andrew.green@marklogic.com
    Phone: +44 7741 272567
    http://www.marklogic.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.