I wanted to summarise the last few months of talking to customers about #NoSQL, and what I’m looking forward to in the next few months though…
It’s been really interesting working with MarkLogic 6 since last September. I’ve had some great customer use cases come across my desk, ranging from TV programme schedule searching through scientific research data storage and publishers allowing their publications’ content to be re-purposed easily for future publications.
What customers ask for
Customer engagements seem to go the same way each time. We tend to come across people with a range of issues, but they’ve all found that existing technology they’ve deployed doesn’t quite fit their data the way they wished it would. Some of the time this is down to storing a variety of formats of data and trying to massage a relational system in to doing that. Other times it is that complex queries around content of their text in a column do not scale well. Still other times it may be they have a set of Sharepoint repositories they need a consistent and secure geospatial search capability over the top of.
They are all looking for an alternative to what they’ve previously deployed before. What very few of them are doing, perhaps surprisingly, is searching specifically for a ‘NoSQL database’. Occasionally we get calls from people who have tried Mongo’s open source NoSQL store (I won’t use the term ‘database’ because I believe you shouldn’t be called that if you can still lose data.) and they want something more advanced, with search, that can scale well. This is still quite rare though.
NoSQL’s presence as a term
A lot of the time we find ourselves educating people about what NoSQL is, how it differs from relational databases, and how it can benefit organisations. This is especially true in the Public Sector in the UK where you’re likely to find a whole lot more people talking about Linked Open Data and Ontologies than you are about NoSQL databases.
I’ve been impressed though by how quickly people really understand the value that something like MarkLogic can bring. Having a highly scaleable database and feature rich search engine built in to a single product really resonates with people. The key value it has to business people is its ability to really allow you to get your arms around all of your organisations’ data.
Exploration vs. Schema Design
Sometimes this starts out as a single search or information publishing architecture, with the system of record sitting somewhere behind MarkLogic. Often though this leads on to MarkLogic becoming the main store for new sources of information, to the point where back end systems that are no longer added to are turned off entirely.
The thing I find particularly interesting is the idea that a NoSQL database allows you to explore the data prior to building out an application on the top of it. MarkLogic’s universal index allows you to load your content and immediate execute word searches over the top. this enables you to get a feel for the vast array of data you’ve ingested prior to adding specialised indexes, or building particular ‘view’ or ‘aggregations’ of the data by creating denormalisations.
Customers immediately see the value of getting their arms around all of the data without any up front, long and complex schema definition phase. The ability to define fields and search across different document structures has proven valuable on a number of customer engagements.
Recently I took part in the MarkLogic World conference in Las Vegas, USA. I found this a very interesting and informative conference. A lot of the time with conferences there is generally a lot of sales fluff. With this conference though you get chance to hear from a lot of customers and field engineers about real life implementations. A lot of these are very innovative. The BBC’s Digital Semantic Publishing (DSP) architecture, for example, is a great way to help news organisations publish material across a range of topics and to the right audience very quickly.
What the future holds
There was generally a lot of buzz at conference around MarkLogic 7 and it’s autumn release. This is because for the first time you will have a product that can hold content as well as triples. A triple consists of a Subject, Predicate and Object. They are effectively descriptions of how entities relate to other entities, and what properties they may have.
You can use this for simple things like ‘Adam likes Cheese’ or ‘Adam is a member of MarkLogic’. Over time this web of facts builds and allows you to perform interesting queries, like ‘Show me all people to know a person who is a member of a local council’. You’re using SPARQL to query for matching entities across an entire graph. This is functionality that underpins some Facebook features, but is increasingly being used for commercial application.
The BBC’s DSP platform, as I’ve already mentioned, uses these relationships to know that a story about a footballer should also be shown when a user search for the team that the footballer plays for. This expands the universe of relevant data for a particular query.
There are times when I miss the ability to do long research projects like at University. There are a few really interesting things I’ve been looking at. Happily, some have come up in prospect calls so I’ve had the ability to delve deeper in to them.
The first of these is provenance of facts. In MarkLogic we often load data and perform entity extraction. We basically ‘tag’ text with an XML element. So ‘Adam Fowler’ would become <person>Adam Fowler</person>. This enables you to perform searches for documents that mention different entities. We call the ability to restrict a search like this faceting. It’s common in search engines, but MarkLogic’s edge is that these tags can be present anywhere in the content, not just in extracted metadata.
Finding the source of facts
What if you have a long document that mentions 20 people, 5 organisations, and 14 places? You can find those documents but you’d have to either manually review them all to find members of a specific organisation, or else use a NEAR query to give an approximate value. Not the most efficient manner to do this type of query.
In MarkLogic 7 you’ll have the ability to store facts that you have derived from documents. You’ll still need some form of interface to infer facts (perhaps by entities’ presence in the same paragraph – as in a recent example of mine) and have a user modify or confirm them, but that is relatively straight forward.
Once you store the facts in a graph though it can be useful to link back to the document they were derived from. Think of performing the above query ‘Find me all people who know other people who are members of a council’. Once you have this person you can view the list of facts about them in MarkLogic 7.
What if you need the original document though? E.g. if you’re a police service needing to keep evidence. Maintaining an additional triple in each graph to say ‘these facts were derived from this document’ is a very simple yet powerful way to approach this. I’ve mentioned this before, and indeed next weeks blog entry will show a finished set of widgets and complex query that does precisely this.
Determining where facts came from is called provenance. There are many ways to do this. I’ve kept it quite simple in my example using just a single relationship class – derived_from. Other specifications though, like PROV-O, provide many more descriptions including who performed the function and with what software.
Evolving entity extraction and classifications
We often talk about entity extraction, as described above. A potentially interesting side effect of MarkLogic including a triple store may be the use of these facts to drive entity extraction behaviour. Lets say you’ve stored entity’s of type Person with a property called name.
Rather than maintain a separate list of names of people of interest, you could instead perform a query on document ingestion to do this dynamically based on the People names you hold in the MarkLogic triple store. This way as you manually tag people as being of interest they are immediately tagged also in all future documents, automatically.
You could even extend this so that when they’re first tagged the tagger can say what to do when a new doc comes in – perhaps include alerting the user to the new documents’ existence, or automatically adding them to a case file.
I’m sure this type of dynamic behaviour could have a variety of other uses. You can imagine a user changing their ‘task role’ and this automatically changing the information they see, by altering their searches under the hood, to provide them with the exact information exploring experience they need for a particular task.
Say you’re looking for identity information on people, perhaps you only then search the parts of all content that include location, membership and name details and ignore anything else, like activity or financial data. This helps reduce the amount of data you have to trawl through and process manually. It also prevents your brain from having to take in such a large array of information when you need to concentrate on one aspect of the data.
I’m continually impressed by the quality of work and variety of use cases being advanced by our partners and customers. I’m looking forward to a lot more conversations – albeit likely large brain teasers! – in the coming months. Of course I’ll keep everyone updated with interesting things that I’m allowed to share through this blog.