Combining Semantic and Document NoSQL worlds…

In this blog entry I introduce some ideas around how a combination of a triple store and Document database in MarkLogic 7 will bring unique benefits…

MarkLogic is currently used as a Document repository and Search Engine by a variety of organisations. These range from Publishers to Government agencies and Financial Services institutions.

These organisations use MarkLogic because it is high speed and has Enterprise class security and ACID compliance for document operations, and a host of search functionality you’d expect in a modern search engine, including real time alerting and advanced geospatial search capabilities, including polygon-polygon intersection.

Many of these organisations also store or publish Linked Data and RDF. The BBC do this to represent relationships between football players, clubs and divisions on the BBC Sport website. This backs up their journalists by not having them consider all possible related terms and criteria that a story should be published under, or tagged with. Merely tagging a new story with a player name ensures that the story also appears on that club’s page, and the football league’s page. They call this Digital Semantic publishing.

This is a good example of combining document (the stories) with semantic (the wider world of relationships between entities) storage and querying capabilities. Using just a triple store, or graph or, of just a document database on its own is not sufficient to provide a compelling solution. Combining a document database with a triple store does.

There are further, more common and generic examples too. I came across one last week talking to a partner company. They hold project documents. A Project is a concept that exists outside of the document world. Project have members, who in turn have one or more roles on a project. They also have particular roles within documents written for those projects. E.g. Originator, reviewer etc. Project’s have project numbers, titles, and documents have particular classes, created/update date/by who etc.

Some of this information makes sense living in the content world – like creation/update times for the documents. Project information and which people belong to projects, though, should exist and be managed in its own right. Topics mentioned in documents and covered by projects blur the lines between content that lives in the document (E.g. folksonomic tagging – ‘metal’ research documents) and project level information – metallurgy research projects.

This too is a good candidate for semantic modelling with content search. You can quickly discover a community of practice and find who has worked with whom before. You can ask who has been in charge of projects before covering geological surveys where documents included the phrase ‘tectonic plate formation’ or ‘atlantic trench’. Again a query that is not easy to answer either with just a triple store or a document database.

Importantly a triple store helps model not just the fact that many entities were mentioned in a document (like using an XML element to wrap a place name), but also crucially the relationship between those entities, and those entities relationship with the documents. This extra context is key to answering more complex questions about all the information organisations store.

I’ve been grappling with these problems and ideas myself recently. I believe best practice will emerge rapidly when MarkLogic 7 is released. I’ll be maintaining my own MarkLogic ontology-by-example based on coming up with small, quick answers to queries that need to run across the document and triple store boundaries.

The next year should prove to be very interesting. There is already a palpable buzz around MarkLogic 7 and it’s semantic features and new triple store, deployment management, and search enhancements. I recommend you all take a look when the release announcement arrives, hopefully in a few weeks time.

For more information, read about MarkLogic 7 on our website.

2 comments

  1. This is very interesting. I am just beginning to explore ML6 for a large healthcare project. Now we use XML Schema 1.1 markedup with RDF for semantic links with all instance data referring to a specific schema. (see the website below) We haven’t explored using triples per se.

    1. Hi Timothy. Your project sounds like a great use of MarkLogic. MarkLogic 7 has built in semantic search capabilities, so is good for managing information as well as acting as a consolation and publishing platform. We’ve just updated our website for the new version, so please check that out to see if it’s of interest. I’m always available on Skype for a chat too if you like. (Skype: adamfowleruk). Watch twitter for #MarkLogic, #askmarklogic and #summit over the next few days to see live tweets of new features and interesting customer stories from our summit series too.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.