…tonnes! In this my 100th blog post on NoSQL I thought I’d write up a few reasons you’d want to consider MarkLogic, specifically, over other NoSQL solutions…
I’ve previously written a blog post ‘NoSQL, huh, what is it good for?‘. I’ve noticed a lot of chatter lately around confusion as to when you’d use MarkLogic, or even what to use it for, so I thought I’d reprise the idea and apply it to MarkLogic. Go read that page first for a general NoSQL overview.
What is MarkLogic?
In the simplest terms, MarkLogic is a single product that combines features of a highly distributed NoSQL database, a search engine, all with application services layered over the top.
The first thing a database must do (in my professional opinion) is not lose data. If you’re using it as the primary repository of information it MUST have ACID compliance, with HA, DR, backups and restore features. To not do so is insanity.
Sure, if you’re wanting a high speed read cache over an Oracle database and you’re using MongoDB then no transactional consistency may be fine – the data is held safe in Oracle. It’s just an inconvenience when MongoDB fouls up, as you can recover it from the underlying transactional Oracle DB store.
I dislike delegating features like availability and consistency of information to the application developer. They don’t manage that for a living – database admins and db developers do. Same for security permissions too. Call me old fashioned, but in that delegation approach, madness lies. The magical interwebs are full of stories of woe of people failing when using non-transactional NoSQL databases for primary systems.    [Great parody video here]
For unstructured and semi-structured document management you need a good way of finding the relevant information. In relational databases you know the columns and table structures up front. In a NoSQL database that is schema-agnostic, you may not have that luxury. Thus a search engine being tightly integrated to your database is a good idea.
In MarkLogic the search engine is part of the same product. Thus you don’t need to ‘bolt on’ a third party product with all the integration code and separate update schedules that implies. Also, MarkLogic uses the same underlying indexes for simple primary/secondary key fetching of documents (a la database access) as it does for use by the search engine. Thus MarkLogic is more frugal on disc requirements for indexes.
Everything in MarkLogic is stored as compressed binary trees – NOT as raw documents – not even simply as gzipped documents – so MarkLogic saves disc space over alternatives. MarkLogic storing documents with an average (say 5-15) range indexes will effectively use the same amount of disc space – for data plus indexes – as the raw document. This is part of our secret sauce, and the algorithms are patented.
MarkLogic’s range indexes are also, frankly, pretty damn cool! You can use them to perform BI style analytics in memory over your entire database in sub second response times. (aka in database Map/Reduce) They power MarkLogic’s ODBC server, allowing access over unstructured data from structured languages like SQL! They also make combining text, value, structure, bi-temporal, geospatial and semantic queries in a single hit a breeze – no more needing 5 products for an application stack.
XQuery and XML are about as open a pair of standards as you can get. If my database or application dev team we’re going to learn anything, I’d prefer it to be an open standard with wide applicability over vendors’ own weird and wonderful languages.
You may not find a deluge of people with MarkLogic skills, but those with XML and XQuery experience will pick up MarkLogic easily, and have great productivity over alternatives once they do.
OK, but what the hell do I use it for?
The short answer is ‘lots!’ – but that’s not particularly revealing! Let’s look at a few small scenarios. I’ll trust the reader to be sufficiently intelligent to apply this to their organisation’s IT landscape.
The sparse data and variety problems
RDBMS work well because they have a known schema. This means they pretty much know what space on disc is required, max, to store a particular row. So they typically go ahead and reserve that space. Problem is, sometimes a row may be very sparse. You use a null value in a few places. All of a sudden you’re using disc to store null values in a lot of place. A pithy intro to the problem is available in the first couple of paragraphs of this PDF.
When will this occur? Think of a contact application. You may have spaces for cell phone, home phone, work phone. Then email. Then postal addresses. Then Skype, Facebook etc. etc. Each individual contact though may only have one or two entries of the possible 200 or so possibilities.
Yes, you can normalise this using a key-value approach, and I’m choosing an extreme example, but it is an issue around sparse data issues that you should be aware of as an application developer.
The issue is further complicated when you consider you may not know the future structures of contact information you have to hold. An address is several lines, a phone number may have a country code, area code, and number. What’s the NextBigThing[™] going to have as a format?
Using an RDBMS means you have to design a schema up front before you take even the first new format of information in. This may be fine in some circumstances, but in others it is not. Consider a social network monitoring (open source intelligence) tool.
You build the first version to monitor people threatening to plant bombs on Twitter and Facebook. All of a sudden, the BagGuys[™] move to using a weird and wonderful social network you’ve never heard of. Their messages, although easy to access via RESTful HTTP requests and JSON or XML, are held in a special structure.
You need to bring this online NOW in order to ensure you have all the information required. You don’t have time to design a schema. You don’t want to eat storage by adding another 200 columns to a table.
MarkLogic is schema agnostic. We are document based, but the thing about documents is that you can store rows easily (one parent element matching a table name, several child elements, one per column, in a flat structure), you can also of course hold key-value (document URI, aka ID, for key, document content for value), document structures, naturally, are easy to store too. Think of it as an ‘envelope format’ that you can store anything inside.
Thanks to MarkLogic’s universal index, all of the structure, values, word, word and phrase stems, permissions and collections are indexed on the document during load. They are available as soon as the document is committed (within the transaction boundary). Thus you can bring on new data types and explore their content *before* you add any specific range indexes over individual fields or metadata to that type. No DB schema design up front required. Thus cheaper and quicker to onboard new information types.
MarkLogic’s universal index combined with schema-agnosticism allows this ‘store and explore’ functionality. Other NoSQL DBMS can only store/retrieve as they don’t index the *content* of the information up front, and they certainly don’t combine it with in database analytical functions as is possible in MarkLogic.
The real time valuable information in flight problem
Let’s say you’re sat in an intelligence analysis cell in [your favourite security agency in your favourite country]. You search on a daily basis for certain key phrases, people, places, or combinations thereof.
You do this manually, but since your agency has joined the 21st century you can also use a search engine to do this quickly, across all allied intelligence data held. You can literally ask ‘Tell me everything we know about [your least favourite terrorist organisation]’.
Now lets say a secret squirrel sitting in an oak tree with a long zoom SLR camera has just observed a member of this organisation drop off information to a particular contact. Separately, the military are boots on the ground ready to go arrest this guy half way across the world. Oopsie.
Is it just that you’re going to miss him? Is it a trap for your soldiers? I have no idea, but it’s probably a good idea for that information to be available ASAP after it is recorded to those guys before they go in. The analyst may not know of this operation so can’t pick up the phone.
This is where alerting comes in. Recall that MarkLogic is a database and search engine in a single product. We use one set of indexes that are updated at the same time as the document is committed to the system (within the transactional boundary). This means as soon as the transaction is complete, the document is available to find. Great.
Another feature of MarkLogic is to take any arbitrarily complex search, perhaps crossing text, element, value, geospatial and semantic terms, and store that as a document. Why you ask? Because we have a special index for these stored searches that is used to fire alerts.
If you save your search you can attach one or more actions to it. Thus you can alert people or systems, or perform some other action (BPM process update?) based on a new document entering the system that matches this saved query.
This is the basis of alerting in MarkLogic. It is as near instant as you can get because we have a built in search engine with sophisticated functionality that is updated as information is added, in real time. This makes our alerting real time. This means our troops would be alerted to this person’s latest position, and thus aware to the fact the mission has changed. Hussar! Lives saved, evil terrorist plot thwarted, world peace ensues. (Because as we all know the latest new technology, like NoSQL, can make you toast in the morning and cure world peace, right?)
Use cases exist for other areas too. These could include Summary Care Records – E.g. ‘We’ve just determined this guy is allergic to penicillin’ in a hospital lab, separately a day later he’s in an accident mountaineering elsewhere in the country. You need to ensure that record is up to date ASAP. The FAA also use MarkLogic for this feature too.
The XML is everywhere, like sticky goo, but how the hell to store it problem
Storing XML on disc directly is a sucky solution. XML is quite verbose and eats space. So let’s not do that.
So what to do? You can store it in an XML column in your favourite RDBMS. You’ll need to extract some information to add to the columns as primary/foreign keys, but that’s no hassle in the grand scheme of things.
Problem is though, you’re not taking advantage of the other content in the XML document. You’ve not got a full text index. You’re not processing the document to perform entity extraction or entity enrichment.
You also can’t add an index over geospatial co-ordinates no matter where they appear in any document in the system. You’d have to extract that from the document and store separately in two more columns. Then what if there are more than one set of co-ordinates?… More schema design work.
OK, so storing XML in an XML column sucks too. What about then bowing to the inevitable and shredding the XML document across a set of tables using the XML schema as a basis. That way all information is available for query, and you can rebuild the document to retrieve it.
Sounds good in theory, and many people who add XML to an existing relational application do this. This is best when the document encompasses several known, fixed relational sets of information. A good example is an e-commerce order document. It links to product information, quantity, price, total price, billing details, delivery address. All structures are known up front, you’re just using XML as a convenience to submit the whole set in one go – basically as a transmission format.
What if you want to do more though? In the above, what if you want to place special tags around people, places, organisations, time periods – basically enrich the document as it enters. What if you want to do a free text search in specific areas of the document ‘Find me documents that mention Adam within 10 paragraphs of Cheese’ – doing that in relational is really, really hard. Try writing SQL for ‘where column contains Adam, with the 10 nearest paragraphs (perhaps rows in the same database table) having Cheese’. How do you define ‘near’ across rows in the same table?
Also, if you have to take on new varieties of XML, or versions of existing schema – not out of the range of possibilities with software upgrades in your application stack – how do you manage two schemas (or 200, most likely) at the same time? The answer – it’s a frickin’ nightmare.
Use the most appropriate database for your data. If it’s XML, and a variety of XML at that, and you need advanced query and/or analytics, then use MarkLogic. Using MarkLogic minimises the amount of ‘plumbing code’ to get data in and out. A more appropriate question is ‘Why NOT use MarkLogic?’ for this data set?
MarkLogic scales well with a variety of data, it has Enterprise features around ensuring data is safe and secure. It has advanced search and analytics capabilities. If you need that functionality, you should at least evaluate MarkLogic.
Why is MarkLogic better than [my favourite alternative RDBMS or NoSQL product here]?
Because they suck… Just kidding! Stop writing the flames right now.
If you were building a CRM solution with a known data model (and many have) then why not choose an RDBMS? The structure and relationships are known. The data fits well in to rows and columns. The documents that do exist have a class and metadata that can be quantified, and fixed. It’s a good fit for an RDBMS.
Many people, however, approach data problems with this kind of thought process:-
- I need to manage lots of information centrally, and not lose it. I’d better use a database.
- We have an [Insert your RDBMS product here] site license/ELA – lets try to use that.
- Let’s model the data we know about in a schema and use that for phase 1.
- Hey look, it works pretty well! Let’s go develop phase two.
- Ah. This is really hard now. We’re having to support 4 different formats, and different versions of those formats, across 50 different sources/states/counties with their own rules, and data requirements… Let’s create a lowest common denomator schema that we can crow bar all these formats in to.
- Now lets write a bunch of code to reformat the data on the way in, and the way out, of our database schema. Hey it kinda works now.
- Oh look. Performance now sucks and it’s costing us millions to maintain. Oops.
There is a better way. I believe in the right database system for the right type of information. If you have lots of documents flying around, be they insurance documents, articles for publishing, intelligence reports, financial trades, even images and word documents – then you should at least consider a document-orientated NoSQL database like MarkLogic. And if you don’t want to lose data, then probably start with MarkLogic.
Is it really that radical to suggest evaluating a document-orientated NoSQL database like MarkLogic for document-orientated information management problems? I don’t think so.
An alternative way of thinking about this data problem is as follows:-
- We have this data we need to manage that changes a bunch of times. I’ve heard NoSQL databases are good for that, lets try one.
- Open Source means cheaper overall, right? Lets use [insert your favourite community edition of a NoSQL database here], because it’s mentioned a lot on the web! (It may even be ‘Web Scale‘ LOL)
- We’ve created a fantastically awesome proof of concept for phase 1 in our community NoSQL database. It was really easy to do! Just a few lines of code. Data goes in, data comes out. It’s really fast too! Let’s go ahead and do this in production.
- Hey, we’re getting some weird reports of some changes not being applied to our data. Also we now need to add new sharing/privacy features. Let’s get the devs on that.
- It just took me 20 devs and 6 months, but we’re now not losing data! Go open source NoSQL!!! Sure, we’re taking a performance hit now we’ve put security checks in the database, and are doing all this weird transactional code stuff, but hey it still works! Let’s keep going!
- We need to process the information we’re storing now. We want reports, analytical charts, we want to provide advanced search. I’ve heard you can plug in [your fave search and / or analytics open source product here] to my community NoSQL database. Lets go try that!
- Phew. Finally got that all plugged in together! Glad we have those 20 developers sitting around to write plumbing code when I need it!!!
- What do you mean our dev costs are higher than the software license and maintenance costs of an RDBMS or MarkLogic???
You see a similar flaw in the logic. Just because you can do something and it appears simple at the start, doesn’t mean the answer is a simple one.
Let’s be clear. I love the Open Source movement. It’s given us Linux and FreeBSD. We have MySQL, Postgresql because of it. And they are awesome products. And Firefox rocks! (LibreOffice sucks though, let’s be honest).
The one lie that is perpetuated is that software that you don’t have to pay an up front license fee for is ‘free’. It isn’t. Every project has costs. If you have to build functionality on top of a product – any product, whether closed or open source as we see in the above two examples – then that’s extra cost.
Every time you write plumbing code, be it to link a database to a search engine, database to a website, ingest information and transform it (ETL type activity), add security checks, ensure your data is safe (ACID transactional integrity) – that’s extra cost. These costs MUST be taken in to account when planning your project. Also take them in to account over the next three years of functionality you want, else you’ll get bitten in the proverbial posterior.
I believe it is damaging to NoSQL generally for people to ignore these issues, as it sucks people in to projects with the incorrect tools for the job. When they come out of the other side, they feel burned, and disillusioned with an entire breed of technology. The ‘trough of disillusionment’ as Gartner would surely put it.
But where does MarkLogic fit?
We get involved in many of the first set of scenarios mentioned above, but increasingly the second kind too. People try to use RDBMS to solve their issues as they have licenses for it. They find, typically in version 2 or 3 of their platforms, that it has scalability or development cost issues.
Invariably they then need a schema-agnostic database, normally in environments with lots of XML flying around (amongst other things…), but with the same features they are used to from their RDBMS. They’ve just started with the wrong tool for the job.
They want transactional integrity as it’s a mission critical problem. Sometimes this means billions of dollars of trades, or even lives being potentially at risk on the front line. They cannot afford to lose information, to have information delayed, to have security issues, or for a system failure. A down day is quite literally a death day for these people.
MarkLogic is one of only around 4 NoSQL database of over 150 with transactional integrity. It is certainly the only one with Government Grade security build in. It’s the only one that has been around for 12 years and used in mission critical systems every day. We’re also the only one to mix semantic functionality with other data types in a single platform.
What’s it really like to use on a first project though?
Click on Boeing here, top right of the page and you get two very good introductory videos, talking around the issues I’ve mentioned above. It’s really worth a watch. There’s plenty more where that came from too. Just click on any of the logos for more information.
Or even better, ask me a question about a particular problem you have in the comments below, and we’ll have an open chat about it and everyone can chip in.