Something a lot of people overlook when they talk about Big Data is what to do with the data itself. A lot of people miss the greater advantages of a database like MarkLogic by thinking in the old mindset of store and retrieve. This blog post aims to highlight what I think is a killer feature of MarkLogic, and one that is unique in the NoSQL / Big Data / Analytics world.
A lot of the time a human is searching for information they are looking for new things. This will not be the first time they are looking for this information. They will run a similar search very regularly. It’s akin to visiting the BBC news home page each day to see what has happened. For that we have the BBC. Unfortunately, for our own corporate data we have no one to process and disseminate it for us – we must each work in our own way. Sure, you can run reports for known information – but what about the known-unknowns (See, Dick Cheney was useful for something)?
Why do I care about alerting?
Think of an information analyst. They may have created one or more complex searches over all information to pull back a full view of the subject they’re looking in to. What they may do is save that search. This is pretty easy, and can be done by most combinations of search engines and application servers – although always requires custom integration code to achieve. What about running that search on new documents? Most search engines will take time to index new information, and then the app server would need to know a document is indexed before it runs a search over that document. Not an easy task. Also it tends to be done in a batch mode. This may be ok for when you’re receiving facebook posts, or order summary updates… but what if your information is altogether more critical?
What if you want to be alerted the instant a message is intercepted mentioning something connected with an attack and a particular network of people? What if you want to know of all large trades or SEC filing data for a company who you have millions of stock in? The time taken to be alerted to this information is measured in lives and millions of pounds. You can’t afford to wait for a search engine to index new data. You can’t afford to wait for a batch process to send a summary. You need to know as soon as possible.
Thankfully because MarkLogic Server is a single server that provides an ACID compliant document database, an inbuilt search engine that is updated in real time, and an application server to execute custom code there is a way to do this. MarkLogic has an Alerting API. This enables a developer to craft an application that allows a user – any user with sufficient priviliges for that database – to be alerted to new content that matches a search – and be alerted as soon as that document arrives. When a transaction to add a new document completes, it is instantly searchable. The Content Processing Framework (CPF) can execute actions and transition between states (akin to finite state automata) immediately after the document is added*. This can run all alerting rules, very fast, and make you aware of information as soon as it becomes known by your organisation.
*CPF can also be configured to process documents just before they are committed to the database – useful if you want them to be enriched before being made available. E.g. when tagging mentions of the place name ‘Walsall’ by replacing this text with <location lon=”-1.9837” lat=”52.5888“>Walsall</location> in the XML content. This would enable your MarkLogic searches to use a geospatial facet – you could draw a square on the map and say “Give me all messages in this area” and the document would come back. No more having to guess every place name in the corpus of your documents.
How does it work?
In MarkLogic you create an alert configuration. This is just a logical name to hold later information. You then associate rules with this alert configuration URI. You then associate one or more actions. These are XQuery modules that perform some useful function. A simple example may be to update that user’s news-for-me.xml document which is displayed usefully in their web app dashboard on every visit. They may send Emails, SMS, instant messages – whatever you want.
To trigger the event you have two options. Do it programmatically – E.g. ‘Run all my saved searches against the whole system now’ – or on document ingest (more likely) via the Content Processing Framework (CPF). This comes with a built in alerting processing ‘pipeline’, so you just enable that and voila! Alerts go a-flyin’. Just create a CPF domain, attach the alerting and status change pipelines, and tell your alert config about that domain. Done.
Big Data sometimes requires Biiiiiiiiig alerting
Consider my previous example of linking place names to longitude and latitude. You could write a massive algorithm to go through every single word of every new document and then fetch the lat/lon for each. This is a massive task though. Using alerting and a reverse query for each place name – yes you heard that right, one per place name – could be much quicker. Afterall MarkLogic’s universal index encompasses each word, and the search API is very quick against this index. All you need to do is register a cts:word-query as a rule in an alert for each place name (the word in the word query) and have this fire some code to link that longitude and latitude to the document. This can be by editing the document, or enriching the text in the manner mentioned above.
Note that I talk about an alert for every place name. This sounds like a big ordeal, but given the speed of the search API, and especially the mechanism of a reverse query in an alert, this is very quick and lean. No trawling over every word in a message. For anyone who wonders why its called a reverse query, its because instead of saying ‘Give me documents that mention Walsall’ you’re instead saying ‘Does this document mention Walsall’, which turns a search on its head, hence reverse query.
Where to start?
I found an excellent resource at Gazetteer.co.uk. This organisation maintain a database of all GB place names. This includes historical places that no longer exist (for genealogy or other uses). Very useful. Also very cheap at only £15! The data comes in CSV format so I’ll be converting each row in to an XML document and keeping in a ‘placenames’ collection in my MarkLogic system. I’ll then be able to create a routine to run through each place name adding an appropriate alert. Note: The data integrity isn’t fantastic in the CSV thanks to the usual ” and , issues inherent with that format. For now though, I’ll just put this document in my database:-
I’ll now set up an alert in my database. Note you’ll need to configure your database for alerts first by going to the database admin web application and enabling ‘fast reverse queries’. We’ll install CPF later, for now we’ll test the alert by hand.
NOTE: Don’t worry about copying/pasting the below code. It can all be downloaded from the link at the bottom of the article.
Here’s a document I have:
let $doc := <article><title>My holiday in the countryside</title><content><para>I love the countryside. And caravanning. Yes its fantastic…</para><para>This time we decided to go to York as we haven’t been there in a while.</para></article>
I’m going to add a module to my modules database called lib-alert-add-placename-location.xqy. I was always a snappy file namer. 8o) Here’s its contents:-
xquery version "1.0-ml"; (: This module is intended to be executed by an alert. The word query that matches the new document is used to find : a placename document. The new document is then enriched to include the lon/lat from the placename inline in the : content :) declare namespace alert = "http://marklogic.com/xdmp/alert"; import module "http://marklogic.com/xdmp/alert" at "/MarkLogic/alert.xqy"; declare variable $alert:config-uri as xs:string external; declare variable $alert:doc as node() external; declare variable $alert:rule as element(alert:rule) external; declare variable $alert:action as element(alert:action) external; let $log := xdmp:log("*** ALERT FIRED ***") let $query := alert:rule-get-query($alert:rule) let $log := xdmp:log($alert:rule) let $placename := $alert:rule/alert:description/text() (: I'm fetching the info I need from description, but it could be passed in the rule <options> element :) let $loc := (fn:collection("placenames")/placename)[./name/text() = $placename] let $log := xdmp:log($loc) (: for debug purposes :) let $log := xdmp:log($alert:doc) let $newdoc := if ($placename) then cts:highlight($alert:doc, $placename, <placename lon="{$loc/lon/text()}" lat="{$loc/lat/text()}">{$cts:text}</placename>) else $alert:doc let $log := xdmp:log($newdoc) return xdmp:node-replace($alert:doc,$newdoc) (: Avoids infinite loop of document-insert :)
Once CPF is configured, and I’ve added the alert with the following code:-
let $config := alert:make-config( "add-place", "Add Placees", "Alerting config for adding a place", <alert:options/> ) return alert:config-insert($config)
and
let $action := alert:make-action( "xdmp:log", "log to ErrorLog.txt", xdmp:database("Modules"), "/", "/lib-alert-add-placename-location.xqy", <alert:options>put anything here</alert:options> ) return alert:action-insert("add-place", $action)
and
let $rule := alert:make-rule( "simple", "York", 0, (: equivalent to xdmp:user(xdmp:get-current-user()) :) cts:word-query("York"), "xdmp:log", <alert:options/> ) return alert:rule-insert("add-place", $rule)
and
alert:config-insert( alert:config-set-cpf-domain-names( alert:config-get("add-place"), ("articles")))
I can now add my document:-
return xdmp:document-insert("/myholiday.xml",$doc,xdmp:default-permissions(),(xdmp:default-collections(),"articles"));
I now have a look at the document in the database, and see it has been modified:-
fn:doc(“/myholiday.xml”) =>
<article>
<title>My holiday in the countryside</title>
<content>
<para>I love the countryside. And caravanning. Yes its fantastic…</para>
<para>This time we decided to go to <placename lon=”-1.082″ lat=”$loc/lat/text()”>York</placename> as we haven’t been there in a while.</para>
</content>
</article>
Cool huh?
You can access all my code from my Query Console Workspace Export here