Choice of NoSQL databases from Cloudera…

Cloudera and MongoDB released an interesting press release this week where they aim to continue to work closely together. Is this AGoodThing[™]? I discuss here.

Cloudera make their living selling services, support and premium add-ons around Hadoop and it’s ecosystem of products. Over the years they’ve supported many Hadoop integrated products, and even built some of their own for profit or the wider community.

As has been mentioned in various places on the web this week, a recent joint announcement by Cloudera and MongoDB is making a few waves. Is this significant? What does it mean for these two companies? What does it mean for customers? I discuss this below.

What does this mean for the two companies?

This press release was just a restatement of both companies’ strategy. Both Cloudera and MongoDB work with technology partners to give their customers integration options. This is continuing.

Nothing new in this regard. Just a restatement of policy in public. Glad handing all around. 8o)

According to a recent Wikibon analysis of the Hadoop and NoSQL markets, MongoDB have 7% of the market (NoSQL JSON document DB) and Cloudera have 10% (Hadoop related tech). I don’t think this will be affected by a press release. (Incidentally, MarkLogic is at 13% market share!).

What does this mean for customers?

This is hard to tell. For MongoDB it shows they have someone big in the Hadoop space able who is willing to help them with their Hadoop Map/Reduce connector. No new features were announced for this though in the release. Development will still be done by MongoDB of this.

Perhaps Cloudera support for MongoDB and it’s Hadoop connector will be an option soon? That may be a good thing as it gives customers of MongoDB a choice other than MongoDB to source support from, potentially creating competition in the market.

As for Cloudera customers I’m not too sure. It may confuse people asking Cloudera about NoSQL. Below is a potential conversation that, as a sales engineer for NoSQL vendor MarkLogic, I can see easily happening:-

Customer: “I’ve got a bunch of JSON documents. Hundreds of thousands of the damned things. I need to store them, and access them like from a database. I’ve heard of NoSQL. As my trusted Hadoop vendor, do you have anything that can help me?”
Sales Droid: “Why yes madam! We are offering support on the MongoDB database. This is built to store, manage and query JSON documents, and scales horizontally. It even has a Map/Reduce connector so you’re existing Hadoop M/R dev team can access those docs!”
Customer: “Sounds just peachy! Oh but hey, I want to also perform detailed analytics of those documents, performing aggregate operations on entire subsets of this data, like groups of columns in my old relational world. Does MongoDB do that?”
Sales Droid: “Ah well, we also sell support for the HBase NoSQL (Big table clone) database. This is built right on top of Hadoop! It’s great for loading in large columns of data and analysing them. This is better for operational analytics than MongoDB.”
Customer: “Oh. Ok. Well if that works we’ll use it… Oh by the way, this will be for a cloud service. We want to make sure that only people with the right access can access the right information. It’s more than just role based access control – I want only people with ‘AnalyticsSubscriber’, ‘MarketX’ and ‘CountryY’ roles to access them – that’s AND role logic, kinda like compartments. Can HBase and MongoDB do that?”
Sales Droid: “Ah well, in that case you need Cell Level Security. We have a higher-end database alternative to HBase we support called Accumulo. This was originally developed by the NSA! Cell level security is what you need!”
Customer: “Well so long as my security model is consistent throughout, I guess that’ll be fine. Oh by the way, I want really fast SQL processing over my data. I heard you had something for that. Springbok or something?”
Sales Droid: “Ah you mean Impala! Yes, it’s a fantastically fast and awesome parallel processing engine that can query data held in HDFS and HBase!”
Customer: “So I bet it can do SQL queries over data held in MongoDB and Accumulo too, right?”
Sales Droid: *gulp* “Err well no actually. You’d have to have it in HBase.”
Customer: “What about Security though? Also, how do I manage my data holistically if I need all this bunch of kit???”
Sales Droid: *eyes scan around the conference center* “I’m sure Fred, our Sales Engineer can help… errr… where is he…”
Customer: “Look, can you sell me a single product, or suite of fully integrated products, that meet my application goals as well as my security and systems management goals, or not!?!”

Yeah that’s gonna be a painful conference. Damn his Sales Engineer for not being around to bail him out! 8o)

In all seriousness though, having all the pieces is one thing – but creating a holistic Enterprise grade solution and making it easy for customers to understand – and buy – is quite another.

I’m glad Cloudera are around looking at these things. They’ve done a great job spotting gaps in the Hadoop ecosystem and plugging or improving them.

They now have the challenge of the IBMs of this world though – lots of bits of kit, but how to sell a holistic solution. It’ll take time to fix – if indeed they can as these aren’t their own products.

In the meantime, customers will have to choose the one NoSQL database that gives them the most of what they need and live with it, or write a lot of integration code (utilising expensive services – open source doesn’t necessarily mean ‘free’, especially in integration and ETL).

This means customers need educating independently about which database most suits their needs.

Obligatory advertisement warning

It’s hard for me in situations like this. The above issues for customers are very real. There’s a lot of hype and not a lot of people answering very, very complex questions for big Enterprises doing mission critical things.

This is why I’m writing a book on NoSQL. Aimed at Enterprise customers and developers it will point out the areas where products are strong, and help you understand which approach and product to evaluate for your own, particular, data management needs.

Back to the above problem though… Doing all these things – JSON document storage, security, operational analytics – and in real time with Enterprise grade consistency, resilience, and systems management tool is what MarkLogic is good at. Check out the MarkLogic website for it’s extensive detail on MarkLogic with Hadoop.

Further Reading

Cloudera press release:

The Register article:

Giga OM analysis:

InfoWorld Analysis:

MarkLogic and Hadoop for Fun and Profit:

MarkLogic Hadoop Resource Center:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.