Hybrid NoSQL: Key-Value use cases

I’ve been talking about how NoSQL database vendors to trying to support more data models from other types of NoSQL databases. I call these Hybrid NoSQL databases. In this article in the Hybrid NoSQL Series, I talk about how other types of NoSQL database can handle Key-Value use cases.

What is a Key-Value store?

A K-V store is a relatively simple data model, but which allows blindingly fast data operations. Because speed is of the essence more advanced functionality is generally not supported.

Most things can be a key-value store. A file system on your laptop is a key-value store. The key is the file name, and the value is the file content. Your laptop doesn’t (generally) know anything about the file or handle it specially – it just allows you to open, read, modify, and save the content. No fancy operations at all.

What can K-V stores do?

Here are operations that all K-V stores can do:-

Store this binary value with this key identifier
Fetch the binary value for this key identifier
Shard my data across multiple servers, evenly distributing my K-V pairs, to ensure performance

Here are some not so common functions:-

Increment/decrement this key’s value (Redis DECR, DECRBY, INCR, INCRBY, INCRBYFLOAT)
Check a key exists (Redis EXISTS)
Append this data to the List (or Set) held in the value of this key (Redis APPEND)
Stack operations (Redis LPUSH and LPOP)
Hash field operations (Redis commands starting with H* – there are many!)
Geospatial field operations (A Beta, in testing, function of Redis with GEOADD, GEODIS, GEOHASH, GEORADIUS, GEOPOS)
Publish / subscribe (Redis allows clients to subscribe to notifications about keys! Kinda cool!)

Why would I want to do these operations in another type of NoSQL database?

Probably because you already own one, or you have the need also for a Document store or a Column Store in another application – and are wondering if you can re-use that. Sure, it’s never going to be as blindingly fast as a pure K-V store – but perhaps you don’t need to process 400 000 transactions per second per server.

What can and can’t I do?

I’m going to use MarkLogic Server as a comparison. MarkLogic is a Hybrid NoSQL database supporting Document, Search and Triple store operations. It can also be used to store arbitrary binary data as well as text, json and XML for which it was originally built. Many source documents are binary – like PDFs. This means it has the core capabilities of storing and retrieving arbitrary data, just like a K-V store.

Store and fetch

MarkLogic can be used to store binary data by simply using it’s REST API. Do a POST to /v1/documents?uri=/some/doc/uri.bin with a Content-Type: set to an appropriate MIME type for what you want. Use GET /v1/documents?uri=whatever to fetch it in its original content type.

The URI in MarkLogic is our key identifier. You can assign one yourself, as above, or ask the server to generate a random one. In this case, the URI comes back in the POST response.

Sharding

MarkLogic large binaries (by default, anything over 1 MB, although this can be configured) are stored as-is on the file system. Changes are journaled, and the data is replicated for HA reasons within the transaction boundary – it is ACID compliant. Database Replication and backups also work for binary data, so no worries there.

The machine that receives the new document (or value) randomly places the document in a shard (called a forest) on a server in the cluster. MarkLogic Server also has a feature called automatic rebalancing, so if you do accidentally add all your docs to one server, MarkLogic will rebalance them across the cluster for you automatically. All without any duplication or data unavailability.

Increment / Decrement

MarkLogic can be used to store XML and JSON data. These documents can be as simple or as complex as you like. They can be a single value if you like, just like a K-V store, but this is not very efficient (just like a K-V store!). Every MarkLogic document has a little overhead – around 750 bytes. So storing a 2 byte value means the overhead is costly.

Better to create a document with many values, and alter part of that document. For example, rather than having a key per product per attribute – like PRODUCT1234-QuantityInStock – You instead have a document for Product1234 that holds quantity in stock, product name, title, bar code, and all other metadata. (This is why K-V stores also support Hashes, which I cover later on in this article…)

But how to manage this data? MarkLogic’s REST API supports a PATCH operation. This can accomplish several things. In our scenario, it can be used to increment or decrement a value by 1 or any other amount. Use the replace patch mechanism, with the operation ml.add – but you can also multiply, divide, do regular expression search and replace, concatenate strings, or do a substring operation.

Check a key exists

If your document in MarkLogic holds your entire key value, then you can simply do a HEAD to /v1/documents?uri=whatever. If you get a 404 response, then the document does not exist.

If you want to check for a value within a single element within an XML or JSON document though, its a little bit trickier.

MarkLogic indexes what is in a document, and its structure, and what elements exist. People don’t normally check if an element either exists within a document or it doesn’t – they generally pull back the whole thing and check themselves. But you can check to see if a particular document has a key at all, by checking the universal index via search.

Effectively you construct a search that is limited to one document – the URI you care about. You further use an element-query (now called a container query) to check for the existence of the element. If your search results return one URI – your document – then it exists. Otherwise, it does not.

You construct this using POST /v1/search. I’ve not tested the below, but something like this in the POST body should work:-

{
  "search": {
    "query" : {
      "queries": [{
        "and-query": {
          "queries": [
            {"document-query": {"uri": [ "/my/doc/uri.bin" ]},      // does my doc exist?
            {"container-query": {"json-property": "mypropertykey"}} // does my property exist?
          ]
        }
      }]
    }, 
    "options" : { "return-results" : false, "return-metrics" : false } 
  } 
}

(NB return-results: false means no result data will be returned, but you will still get a count which you can check – it’ll be 0 or 1.)

Append data to a list

The MarkLogic Patch operation also supports an insert mode. With this you can append one or more elements to an existing container element. So you could easily implement lists using insert with position set to ‘after’.

Stack operations

Patch’s insert mode allows you to specify position as ‘before’ or ‘after’ allowing to to place items at the front and end of a list. Using Patch’s delete mode you can also delete the first (/list/parent/child/fn:first()) or last (/list/parent/child/fn:last()) elements of the list too.

You can also fetch a particular element rather than a full document. To do this you create a very simple transform (XQuery or XSLT or server side JavaScript) and use it with the GET /v1/documents?transform=mytransform endpoint. So you could have one saved for ‘first item in list’ and one for ‘last item in list’, perhaps even a ‘get the whole list’ if you have a complex structure in your key value documents.

Thus not only can you implement stacks (push and pop equates to ‘after’ and ‘fn:last’), but also FIFO and FILO queues!

Hash operations

A Hash in Redis works just like a HashMap in Java. A value holds a list of different properties, each where their cardinality is just one. You can then perform operations on those keys within the hash.

In a document store, you instead create elements within the document, like this:-

{
  "field1": "some value",
  "field2": "some other value",
  "field3": 45
}

These data structures are much easier to manage in a single command as an aggregate in a Document store than they are in a K-V store. If you’re storing an aggregate in a K-V store and writing a whole lot of functions to update them, consider using a document store instead.

Fetching and updating aggregates is a simple case or storing or retrieving a document in a document store. Using patch you can also modify just part of a document.

Geospatial field operations

Many document stores support ‘secondary indexes’. This is where you instruct the NoSQL database that one or more elements within the document need indexing specially. Typically this is for range (less than, greater than) queries, but this can also be for geospatial queries.

By simply configuring a geospatial index in MarkLogic, listing the fields holding the longitude and latitude, is enough configuration to support geo queries. No need to instruct the database about a key-value being a geo value every time you add a document. It’s a one off piece of configuration.

MarkLogic has many, many, many advanced geospatial features, including search for points in a document (documents can hold many points) that are in a bounding box, point-radius (circle), or polygon. You can even return search results ordered by reverse distance from the centre of a circle or polygon! (Like the typical hotel search application).

Again, you can also check for documents where two fields (the longitude and latitude of your geo value) exist – so you can find just documents with geo data in them too.

MarkLogic sports far too much geo functionality to document here. I’d advise you go check it out, ‘cos its awesome.

Publish/Subscribe

MarkLogic Server includes a ‘reverse query’ index. This enables you to take any search – including one for any changes for a single document, or part of a document – and make an alert fire when a document is created or changes and meets all your search criteria (which can be as arbitrarily complex as you like).

These alert actions can be XQuery or JavaScript code that does anything. They could send a message to an email server or web service. I’ve personally used Node.js to create a W3C WebSockets server that keeps a connection open to a client. The Node.js server also has a web service endpoint. MarkLogic fires a message to this endpoint when the alert is triggered, and Node.js passes this through to the web app client using WebSockets.

Not the easiest of set ups when compared with Redis – because MarkLogic doesn’t support keeping a connection open outside of a transaction boundary for ACID reasons, so doesn’t support the WebSockets standard – but it is possible to do with only a few lines of code in Node.js.

In Summary

Can you use a document store as a K-V store successfully? Yes. And for all operations too? Yes. Yes it will be slower, but speed is relative.

For example, my laptop with VMWare with CentOS, with MarkLogic configured with only one forest (shard), can do between 3000-4000 writes or updates per second. Typically for an 8 core machine you’d have approx 6 shards. So you can get in to the tens of thousands of updates easily on a single machine, let alone the minimum 3 node cluster. (minimum for HA).

If you need extreme speed, then use a true K-V store. Redis (and Redis Cloud) are great open source options. Aerospike is my fave commercial option (it’s great at natively managing flash storage) – but if you don’t need more than 100 000 updates per second, then you can use a document store like MarkLogic fine. (They’ll probably work even quicker if you disable indexing you don’t need).

Other document stores may be able to support less functionality – especially around element-exists queries and document patching operations (increment/decrement).

Got MarkLogic? Need a K-V store? Don’t need more than 100 000 updates per second per server? Then feel free to use MarkLogic.

6 comments

Leave a Reply Cancel reply