What do we need from a content ingest tool?…

If you look around the contents of developer.marklogic.com you will find many a project linked on Github that provides a piece of functionality to extend the ingest or transform operations over a MarkLogic database. We’ve made great strides with mlcp (Content Pump) in version 6.0 to create a single strong Java based ingest tool. We also have Information Flows in the Information Studio that can pull information from the filesystem on the MarkLogic server, transform it, and store it.

These tools are great in their own right, but there’s nothing that pulls them all together without writing some XQuery code. No admin interface to rule them all. All the pieces are there, but no administrative glue to link it all together.

Consider the following task. Lets say you want to fetch new messages on a twitter feed every 30 minutes. Here the processing I want to do:-

  1. Start condition: Scheduled task, every 30 minutes
  2. Fetch JSON document from URL: Build URL string, Security info, resultant document collection, permissions, properties -> Single document
  3. Convert JSON to XML: URI, collection, permissions, properties -> Single document
  4. For each message node: XPath -> nodes
  5. Transform document XML Fragment: XSLT / XQuery, URI, collection, permissions, properties -> Single document
  6. Collect post for: for-each variable list, sequence variable name -> collection of URIs
  7. Create report document: uri, collection, permissions, properties -> Single document
  8. Complete process

I could code the above in XQuery, but if all the constituent parts already exist, I could instead just configure them in some sort of simple process flow. Effectively each process step would have this associated with it:-

  1. Namespace, module and method being invoked
  2. Input type: None, document, documents, node, nodes (with or without schema specified)
  3. Output type: as above
  4. Input Parameters: Simple, list, hardcoded string/xml, linked to variable, sequence

This is quite a straight forward API to add processing steps for, and to do so quite quickly. It would also be relatively trivial to construct an XML process definition schema, and a module to execute each process step. You could even serialise the process definition in to XQuery. Cleverer still, perhaps use XQueryX as the file format, but only support a sub set of the language’s features in order to make diagramming easy.

There are a number of step types. These can be broadly categorised as follows:-

  1. Initialisation step: How the process is started. Either by an alert, db trigger, CPF pipeline, or scheduled task
  2. Document production step: Always results in the creation of one or more documents (thus has parameters including collection, permissions, properties, uri). E.g. get from URI, get from DB (fn:doc/fn:collection), unpack (zip), get from file system, get from information flow, ISYS Filter
  3. Analysis step: produces a node or node sequence. Include Apply XPath, For-each, set variable (or add to sequence/map), group by, transform xml fragment (XSLT or XQuery function)
  4. Altering step: set document properties/collection/permissions, delete document, replace-node, Collection (like a rendezvous after a for-each) can also merge multiple sequences in to single sequence
  5. Completion step: Where the process stops

For complex processes, you may want a call sub-process step.

The benefits of the above are that no matter where content is coming from you have a single content processing mechanism. Also because this mechanism is within MarkLogic itself you have the full range of Marklogic functionality available to drive you content processing. If you use CPF as the initiating mechanism you can even perform all of this prior to the commit completing.

The following  are gaps in the design or limitations that we may come across quickly.

  1. Batches – the biggest limitation I can see. Normally MarkLogic pipelines or alerts happen against a single document. You may want to fire a process though after adding a batch of document in. This could be after an MLCP call or executing an Information Flow. We need some way to link a batch name to a particular process to be executed.
  2. Interpretation vs. compilation – we’re going to want this to be fast, which means an entire process should be compiled as a single XQuery module and then executed. This is quicker than doing an eval for each and every step in a process. If we use XQueryX as our storage format this compilation step largely goes away. We’d need to be careful in how we rendered a process and how we drew a diagram when loading XQueryX. Perhaps the XQuery 3.0 annotations can help with this?

Things this could be used for:-

  1. Monitoring server file system changes over time. (Like the old FileNet Records Crawler)
  2. Applying standardised or configuration driven processing of incoming content. E.g. if you received a dump of data from a publisher you can process it all as a single batch without user intervention.
  3. As a method for breaking down complex ingestion XQuery in to discrete tasks, enhancing re-use
  4. Access the full range of MarkLogic capabilities on processing documents without resorting to Java (E.g. Corb)
  5. Extending the usefulness of MLCP (Content Pump) on its own.

If you think the above would be useful, please leave a comment!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.