MarkLogic and Linked Data

I attended the UK Government Linked Data Working Group event on Monday this week. The reason I attended is because over the last three weeks I don’t think I’ve had a single prospect or partner that has not mentioned either Linked Data, Open Data or publishing to Triple Stores.

History of Open and Linked Data

There is a common pattern and history behind these requests. The Coalition Government has since it gained office been keen to open up Government data to third party organisations. Partly this has been from a desire to be open and transparent, but partly it’s also been to encourage new start up companies to create compelling applications from Government data. There’s even an Open Data Government License to go with the data, consistent throughout Government. The idea being that the economy could be helped through greater innovation based on Government data.

In addition to this, over the past 10-15 years there has been a lot of talk around the Semantic Web, what it’s benefits could be, and how to describe and link data in much the same way as we do for pages now on the Web. This has evolved over time in to the realisation that modelling relationships between data is and of itself a valuable exercise. From this you can draw conclusions. So for example, if you know that Adam is employed by MarkLogic, and MarkLogic’s address is One Kingdom Street, London, then you can surmise that Adam can be sent mail for work at One Kingdom Street.

Current systems

These various needs and history has resulted in an interesting set of needs in the UK Public Sector. Many central Government organisations have put in place basic systems to collate information they know to be useful or required for reporting purposes. These may be old Relational systems, incoming CSV files from other departments / public institutions, or simply shared document repositories.

Many also have Triple Stores either deployed or are selecting them soon. Those that do not have these systems yet are at least asking about them in Requests for Information (RFIs) for systems they know they will have to publish information from. This means that software vendors, including MarkLogic, will need an understanding of user requirements for storing and publishing Linked Data.

Processes around Linked and Open Data

First a couple of quick definitions. Open Data in the UK context is information the UK Government and local councils have published under the Open Data License. It can be used to develop both commercial and free applications. Linked Data is data that contains entities and properties. Some of these properties, E.g. City name, may be described in further detail in other systems. An example of this is the geonames or dbpedia projects. So although our data we’re storing doesn’t define this City, it does say ‘Here’s the city this expense report is from, you can find more info on this city on dbpedia’. This is a data link, hence Linked Data. It’s the data equivalent of a web page’s hyperlink. References are via URIs. So if you’re interested, follow this URI and you’ll find further information on that City.

So we have some technologies we can use to publish Linked Data. We could publish an XML document with RDF information describing it. We could also publish this RDF information to an internal or centralised Triple Store to enable semantic queries to be ran against it.

This is where a lot of the semantic web technologies have concentrated on, along with dereferencing information, traversing RDF graphs, and creating registries to list alternative data that is the ‘sameas’ some candidate URI. This does leave one question unanswered though…

Where does the data come from?

Traditionally data has been published by Government in formats people could read. Things like web pages, but mainly PDF files. (Although I suspect Gov’ is declaring war on that format now) Many CSV downloads are also prevalent. Where does the source information come from?

Source information could be generated by Department Analysts or Statisticians. This may be based on information from external agencies (E.g. local council’s housing data) or from internal information the department is responsible for. (E.g. water quality, or spending reports.) A lot of the time there is no automated systems for this being submitted, cleansed, collated or subsequently published. A lot is held within the minds and experience of civil servants, and can be a very manual process.

I have seen many examples of CSV files being sent throughout Government. Not a bad way to send data – at least it’s easy to convert and read. The problem is that no one is describing the data in the same way. Also, the same data may be described in multiple ways. An example is a CSV column of ‘Local Authority’ in one CSV file, say for March, whereas April’s CSV statistics list instead ‘Council’ for the column name. This lack of consistency is a barrier to information collation, and subsequent publication.

What can be done to help?

Firstly you need to determine the Classification of the incoming statistics set. Is it a list of Costs incurred? Are they expense reports from staff? Are they contracts signed with an organisation? Is it a monthly or annual submission? Determining the class of document gives you a hint as to the format of the incoming document. You may be then expecting three potential column names for ‘Local Authority’. This may be based on previous reports you’ve seen of this class.

Systems like MarkLogic include a Support Vector Machine (SVM) classifier that can be trained using existing information you know is, for example, an expense report. You say ‘Here are 100 expense reports.’ You then ask ‘What is this new document?’ MarkLogic will make a best efforts guess as to the document class. Confirming or changing this class with continue to train MarkLogic until it has a high accuracy.

Entity Extraction can be used to further enhance the data you’re storing. We used a CSV file above as an example, but what if the data you’ve got isn’t columnar in nature? 80% of organisation data is estimated at being unstructured. This may be in the form of letters of complaint, email messages or policy documents. Some data may be in a semi-structured form. Legislation, for example, follows a consistent format with Acts containing Sections. A human may say ‘this will come in to force as defined in the Data Protection Act 2000 Section 12(d)’. There is nothing really about that that says section 12(d) is an activation clause in the data, which could just be a basic <h2> followed by a <p> tag in HTML.

Clearly some method of identifying an entity and saying ‘Hey, this is an activation clause’ or person, organisation, journal reference, and so on, is required. This process is called Entity Extraction. MarkLogic has the ability to plug in different entity extraction engines in to the Content Process Framework (CPF). This is a set of pipelines that can be defined to process new or modified documents. You may have an internal list in your organisation of Legislation, People, or Organisations. You could use this as a basis to automate the marking up of your incoming data – no matter of the source format – to highlight entities. These can then be used for later searching or reporting.

Entity Enrichment takes this idea further and does something with an Entity that has been identified. You may, for example, link an Ordnance Survey National Grid Reference or Post Code to a longitude and latitude, or local authority, police authority, city or some other useful information. You may instead just record an RDF+XML link to an external repository of information, such as the dbpedia or geonames examples earlier in this post. Again, MarkLogic has the ability to do this within the CPF framework. This can also be accomplished using our Alerting framework pretty quickly. I recently added the above ‘city to longitude and latitude’ example to my MarkLogic demo in around 40 lines of XQuery. A few clicks of configuration, and all my incoming documents had place names enriched with longitude and latitude. Happy days.

Once you have this data in, you’ll need to store it consistently to allow for rapid access. The way MarkLogic stores information internally is using XML. You could keep an original as CSV, PDF or whatever without an issue. Storing a representation as XML though starts to give you some advantages. Firstly, the above Entity extraction and enrichment is easily accomplished by inserting an XML element around a City name, for example. Secondly, classification can be done through either a meta data property or through storing the document in a particular collection. You can then use a single access mechanism – XQuery and XPath – to perform rich queries across all your data. Whether it came in as a relational dump in CSV, a PDF report, an XLSX spreadsheet, or as native XML. (Don’t panic – you can get at this without XQuery – E.g. via REST and JSON – see my other posts for more information)

The power of this approach is quite subtle, and unique to MarkLogic. The strategy internally is called an Enterprise Data Layer – you take information feeds from a variety of internal or external systems, or individual user uploads, and store them all in one place. In the UK Public Sector context you have a reasonable chance that the information is useful. For things like Social Media Analytics, however, you may not know the utility of the information until you store, enrich and query it. You may want to, for example, see mentions of flu symptoms throughout the country. If you spot a spike in one location, you may have found an outbreak.

Storing such a vast variety and volume of information is a sweet spot for MarkLogic.You can very quickly build indexes across your Enterprise Data Layer and perform searches over that data. All without the up front time spent data modelling associated with traditional relational systems. Just add in a few basic indexes and you’re away with an EDL Search application. You may create an interface allowing people to explore this data. (We even have a point and click wizard to help). Perhaps you have an incoming ‘Data Cleansing Pending’ collection where a user verifies MarkLogic’s assumptions about the data, and perhaps reject that information for correction from the original author.

MarkLogic is a database built with a search engine built in to the same kernel. No installing of separate, complex, and costly pieces of software only to spend ages trying to get them to work together. Our indexes can be used as search facets. This works like the side bar in search results in Amazon and Ebay, but aren’t limited to text. You may add your own search functionality such as tag clouds, geospatial (map driven) search, or thesauri for similar terms. All built in at no extra cost, and no long lead time.

Once you have an interface to browse and explore information you then need to decide how to share it. Internally to the organisation you may annotate each document with a taxonomy, or folksonomy, perhaps from an existing ontology. You may do this manually, or automated by MarkLogic, perhaps also add keywords. (Although this is kind of mute because of MarkLogic’s universal text index, automatically created across all data on check in). You may create your own dataset for a particular publication need. Perhaps you’re writing a report on all local government IT contracts across the UK. You then need to provide this to your user base.

Creating the links is obviously a major need! It is easy to envisage an interface that displays the document and any meta data extracted, or derived, from it. Using its classification and this information the interface could suggest or build links from known trusted sources such as dbpedia or other departments’ linked data repositories. Once all the above information is provided this should become a relatively simple requirement to complete. For known information schema this can easily be automated. it’s an information analysis problem – not a technological ‘nuts and bolts’ or publishing problem.

Internal Users

Typically each role has a set of tools they are used to using. Business Intelligence Analysts usually prefer things like pivot tables in excel, or a BI tool such as Cognos, Business Objects, Crystal Reports, or Tableau. Many organisations struggle with Relational systems because they have to go away and remodel the data in a way that enables efficient reporting. These are known as Data Warehouses. This means your data for your reports is not always up to date – usually a warehouse is updated overnight with the ‘latest’ data. It also means a load more storage and database licenses. Not a welcome fact in these frugal times.

MarkLogic instead allows users of these tools to use the common ODBC interface to access MarkLogic indexes over documents of unstructured data in the same way as they’re used to for their relational data. Except that we don’t have to duplicate the data to a warehouse to do it. And all of our data is queryable the instant it’s added, in real time. All from the same single installation of MarkLogic. A much simpler and neater architecture. Oh and more scalable, by the way. (Can you tell I work in pre-sales yet!?!) With MarkLogic you can store all your Enterprise Information and report over it from your existing tools. This brings the whole world of your organisation’s data to your analysts’ fingertips.

External Users

Open Data means publishing information externally. Some information you have may be suitable, some such as personally identifiable and sensitive information may be restricted. You need to manage this information within your organisation’s security and data protection policies. This means some information in your Enterprise Data Layer may be protected. You could rely on Classification coupled with user confirmation to manage this process. Alternatively you could tag individual documents as being ‘disclosable’. You then need a mechanism for publishing.

MarkLogic can be accessed from both low privileged accounts – giving perhaps just access to ‘Open Data’ – and more secure accounts, such as for information owners and authors. All in a single database instance. MarkLogic is also used by many of the world’s publishers to present information in a variety of formats they need for publishing. You may create a representation in RDF+XML, or facts about a document in Turtle, JSON or any other form. These can be mapped via the MarkLogic REST API or your custom code to public URIs. This is because MarkLogic includes a built in application server for this need. Again, no complex extra infrastructure required.

Alternatively, you may use the Content Process Framework’s lifecycle event handlers to push published information to an external hoster. This could be as XHTML to a Web Content Management System (CMS), or as an RDF graph to your organisation’s Triple Store. Any of these mechanisms can be supported in MarkLogic. We also provide a RESTful API to access information. This supports XML and JSON as query response formats.


Having the desire to publish Linked Data is admirable and desirable. Organisations need to build a strategic infrastructure for this information management challenge as they have in the past for other types of challenges. This is an information management challenge and is more about process, knowledge, and governance than it is about the underlying technology. An analyst doesn’t care about Turtle – they care about getting answers. A member of the public doesn’t care about RDF links to other repositories, but they do care about searching and browsing for relevant information. A focus on the ultimate business solution is required.

This business case will always be required for Departments to commit budget to a Linked Data project. If you can prove the value of adding the links, then great, but more likely the value will be in assisting the analysts compiling the information and getting it ready for publication in a controlled and rapid manner. If you want a Linked Data project to succeed, you must first answer these questions.

Naturally, MarkLogic believes it has a lot to provide in the area of information collation, cleansing, enriching and publishing. We have proven capability in the Publishing space, but also recent success stories in the UK Media and Public Sector. The BBC used MarkLogic during the Olympics (Also slides from Jem Rayfield) to collate results and news stories up to the minute and make them accessible to millions of website users worldwide. Our partner, The Stationary Office (Formerly HMSO) created the website for The National Archives to enable browsing of legislation and finding out what parts of law were in force at what time. This was powered by a MarkLogic database. This website even won an award for Open Data.

I am very upbeat about the possibilities of Linked Data as well as Open Data. I see a lot of good quality people in the UK Public Sector as well as our Partners working on these projects on a daily basis. Happily, they are all working in the same direction when it comes to these projects rather than each doing their own thing. Clearly initiatives like the Open Data User Group and Linked Data Working Group are having a positive effect on influencing government. These initiatives are maturing, and being joined by new ones such as the Open Data Institute being led by Jeni Tennison and others. Their remit is to encourage start ups to use this data and make profit from it, and to provide matched funding for these projects. This can only act to increase the demand for data, and its richness and usability, which can be enhanced through publishing it all, where possible, as Linked Data.

I can see myself blogging on this subject increasingly in the next few years as end to end solutions are put in place by MarkLogic and our partners in the UK Public Sector. The technical knowledge is out there – we all just need to work together on proving the business value in these implementations.

One comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.