Why NoSQL needs schema-free ETL tools…

Many NoSQL databases lack tooling and analytics. Here I discuss why a schema-agnostic modern ETL approach can help NoSQL vendors and customers…

For any operation involving data, or indeed computing generally, you need three things – Input, Processing and Output.

The difficulty of processing data in to / out of NoSQL – the uncomfortable truth

NoSQL databases are a fantastic way to store disparate data, or data whose structure changes rapidly – for example where you don’t have control of the source formats.

What they tend to lack though is advanced tooling to onboard data in the first place (The Input piece), tooling to perform data engineering, advanced analytics and data science on that data (processing), and finally tooling to display the results or visualise what is contained in your NoSQL database (The Output piece).

For a long time this hasn’t been a barrier to the adoption of NoSQL. NoSQL adoption was traditionally bolstered by developers using these databases’ developer-friendly APIs in order to wrap them inside a be-spoke application. This has worked very well for growing the NoSQL market in the early years.

For this market to continue to take off though, and challenge the traditional database vendors, a wider range of people need to adopt NoSQL rather than just API savvy developers.

Even developers don’t like writing boring ‘plumbing code’ – code that just links data from one place to another. It’s dull and repetitive. Customers don’t like it either – as anywhere that code is needed inevitably means a maintenance headache, not to mention a long time to write and test it in the first place. This means increase costs to initially deploy a new technology like NoSQL.

Equally, on the output side, if you can’t rapidly visualise the insights you can glean from your data then you cannot fully realise the benefits of your investment in NoSQL database technology. Trying to code around the problem leads to longer project times, and the aforementioned increase in costs associated with custom coding.

Many NoSQL companies have tried to shoe-horn SQL support in to their products in an effort to bridge the gap between traditional BI vendors and their products. This has only been partially successful. BI is very much fixed schema at the final 10 yards of creating a visualisation. These SQL layers impose these constraints and remove some of the great flexibility and built-in features these NoSQL databases provide. Thus customers who try this don’t fully realise the benefits the NoSQL databases can offer, decreasing return on investment.

For these many reasons, custom coding for the Input, Processing and Output of data held in NoSQL databases greatly increases the barrier to entry of customers in using NoSQL, and limits the NoSQL market’s growth.

What is needed is better tooling around these NoSQL databases.

What tooling is available generally, right now?

Tools with user interfaces that enable non-developer users to interact with data held in a variety of systems and visually create data processing reduce the barrier of entry to use new technology. In the traditional Relational Databases (RDBMS) space, ETL tools – Extract, Transform and Load – performed this function.

Of course the historic problem is that your ETL process is fixed schema upon creation. You effectively hard code the set of fields in your ETL process when you design it. If the underlying structure changes then best case scenario – the new data is ignored. Worst case scenario – your ETL job breaks.

In the NoSQL world where your data structures are many and varied, and change often, fixed-schema ETL constrains you too much in what you can do.

But NoSQL could still benefit from similar tooling. Tooling that enables non-developers to read data from a variety of systems, cleanse it, discover information about the data, merge it with other sources of data, perform statistical analysis, machine learning, and other advanced operations on it, and then store the enriched data and new insights in to a target database – normally either a NoSQL database or in-memory store for rapid reporting.

Such tools would be very useful to customers that are looking at adopting NoSQL.

A Flexible-schema ETL tool

I like cake. I want to have my cake and eat it. I love NoSQL, and I love putting easy to use tools in front of people so they can get rapid business benefit from their technology investments.

I want schema-free ETL that works with NoSQL.

“Adam, you crazy!” I hear you say, “ETL ain’t ever gonna be that flexible, it’ll never help us in NoSQL!”…

…well, guess what…

Turns out there is a way to perform schema free ETL with support for hundreds of data sources and sinks, machine learning, and feeding the data to a visual business analytics/BI dashboarding layer. Oh and it’s got an open source core.

This particular trick is held within two features of the Pentaho Platform.

“Ah! Hang on, you work for them! You shameless corporate sell out!!!”. Yeah, well, I do work for Hitachi Vantara, owners and purveyors of the Enterprise Edition of the Pentaho Platform. That’s true.

I don’t work anywhere though where I don’t believe in the product. I wouldn’t be mentioning Pentaho Data Integration – the open source ETL tool – if I didn’t know for sure it could help with the NoSQL flexible schema tooling problem. So hear me out.

Pentaho Data Integration on the face of it looks like every other fixed-schema ETL tool you’ve ever seen. If you drag on an import step and point it at a data source, the fields you see in the data stream are those seen in the data source – and are fixed for the rest of that ‘transformation’ (or flow).

Pentaho Data Integration (PDI) Metadata Injection

Pentaho Data Integration though has a unique feature called Metadata Injection. This enables a parent transformation to dynamically set the configuration of steps in a child transformation. It’s used in places where you have many slightly different transformations.

A good use case of Metadata Injection is where you’re reading one data source – say a relational database – and sending that structure of data in to a target system – say a NoSQL database. I may develop one transformation to read my sales table and load that in to sales JSON documents, another for customer details, another for in-flight shopping baskets, and so on.

Creating 500 of these for 500 source tables would suck though. Man oh man the time taken!!! This is the problem with most other ETL tools.

All these transformations would look suspiciously the same. They’d have maybe ten steps that load data, set a few temporary variables (like JSON collection name, maybe some constant or calculated fields in the target JSON structure), and then load the data in to a particular collection.

500 transformations multiplied by 10 steps – 5000 steps to manually configure… Ouch! Carpel tunnel here I come…

Where Metadata injection comes in is that you can create a single transformation to do this load, but parameterise it from a parent transformation. You may even configure this parent in a single job and loop over a list of input data sources to perform the work.

Thus you now create just two transformations – one with 10 steps and a parent one with another 10-ish steps that loop over the table set and call the child transformation using metadata injection.

2 transformations totaling just 20 steps. My mouse-using hands can handle that easily!

So using an ETL tool that is flexible enough to load disparate structures in to NoSQL now becomes possible, and even low cost, thanks to Pentaho Data Integration’s Metadata Injection support.

PDI to aid data discovery, and semantic relationship discovery

But what about loading a variable data lake in Hadoop or NoSQL which contains greatly varying structures?

Well, Pentaho Data Integration can load that data too. You can load the JSON data (for example – XML supported too!) and parse it in Pentaho. The JSON Input step supports metadata injection too. Thus you can sample data (even just a single record), and then invoke a metadata injection transformation to process data with varying schema.

You can even do some crazy stuff…

I developed a custom step with colleagues in the data science team a month or so ago that does some great magic. It’ll analyse all data in a stream in a transformation and output summary statistics about it.

Big whoop. Who cares? … Well…

What the step does is determine the type of each piece of data (irrespective of the data type in the source system), and determine whether that field is categorical or continuous. It counts the number of unique, null (or not present), values, and for continuous fields calculates min/max/median/mean as well as skewness and dispersion.

In short, we determine the make up of every field and every piece of data in the source system. We then store that metadata to later drive our ETL processes with metadata injection.

In the NoSQL world where this becomes relevant is loading massive amounts of data from a variety of sources, and determining through data science, rather than manually configure, how data entities link to each other across systems.

Using this approach combined with metadata injection will allow Pentaho transformations to load multiple sources and suggest to an integration developer the Entities that exist within an organisation’s data, and the relationships between those entities.

Basically you can use Pentaho to discover the semantic linkages between an entire organisation’s data.

You can then use this information to dynamically configure your target system and metadata injection to load that data and blend it, and set up the relationships, semantic relationship models, and other metadata in the target, perhaps NoSQL, database.

This is particularly useful if you have thousands of source record types and don’t want to have to manually configure these meta models in your NoSQL database (be it a document store, or hybrid document-graph/triple store).

We ran this over existing demonstration sales data information and were amazed in how useful the semantic graph was after discovery. All the main entities just popped out on the screen on a semantic graph, with discovered relationships and data types shown, and the strength of the correlation.

Basically using Pentaho Data Integration with NoSQL just saved your customers months in data discovery, modelling, and data loading development.

You’re welcome.

What about data processing?

Pentaho Data Integration also comes with a myriad of data science plug ins, statistical functions, and third party plug ins on the Pentaho Marketplace. This means any crazy data manipulation, data engineering, feature creation, statistical modelling, or machine learning you need to do, you can orchestrate with Pentaho.

Pentaho becomes a central place to do this regardless of the underlying data store – so customers don’t have to rely on database vendors to embed these facilities, and NoSQL database companies don’t have to invest the millions of engineering dollars to build them.

Even integrating machine learning in Spark, Python, or R becomes a simple case of dragging and dropping single step on to a transformation.

Again, you’re welcome.

Visualising NoSQL held data

“What about the visualisation piece!” I hear you cry, “You promised help with that too!!!”

Well fear ye not…

Another awesome feature of the Enterprise Pentaho platform – that is Pentaho Data Integration combined with Pentaho Business Analytics – is that of exposing Data Services.

A data service is configured in a Pentaho Data Integration (PDI) transformation. You right click any step and say “Hey, the data stream I have a this point, I want to expose as a JDBC compatible data source”.

It could be anything. A csv file, a NoSQL set of records, anything. When it is exposed, the data set is given a name and you can connect to it from any JDBC compatible BI tool (Although we’d prefer you to buy licenses for Pentaho Business Analytics, of course! ;o) ).

This Data Service can have several options. It can be cached and refreshed every X minutes in order to reduce load on the source system. It can also, crucially, ‘push down’ the WHERE clause passed through JDBC to the ‘Input’ step configured in the source system.

What the hell does that mean? Well…

Rather than load all customer sales from your NoSQL database and cache them in memory, only for PDI to filter them by customer number, we can instead ‘push down’ the customer number in to the query we pass to the NoSQL database in the first place.

So the Data Service just became the equivalent of a simple function call with parameters, and only loads the data you absolutely need to answer the query passed to the data service. This executes much faster than a traditional SQL translation layer.

Pentaho can do this for any data source that supports querying, search, or filtering. I’ve developed data services to do this for customers using MongoDB and for MarkLogic Server, for example. We have a native MongoDB step, and I used MarkLogic’s REST API to push down the query to that NoSQL database. It was pretty easy.

I then exposed that to our Pentaho Business Analytics dashboard for live querying and viewing of a few thousand records on my laptop. Worked like a charm and executed within seconds!

And the savings/benefits?

Once I figured how to do it, it took 5 minutes to develop the transformation to load customer data in to NoSQL using PDI, another 5 for the transformation with a data service, and another 5 to configure the dashboard.

So 15 minutes from data to insight. Easy!

Of course developing many of these transformations using Metadata Injection and variable schema will take a little longer than this simple example, but it’s a darn sight quicker than writing data loading code, not to mention cheaper versus development with maintenance over time.

Our ETL models are basically XML files built and documented visually. Someone can open the transformation in two years time and make instant sense of it. The same can’t be said of most integration code bases!

In summary

In Pentaho Data Integration (PDI) the NoSQL community can access the ability to create schema-free and variable schema data loading, data science and integration transformations whilst avoiding creating a myriad of transformations, thus greatly reducing implementation costs associated with NoSQL systems. They can also be called over REST if needs be for dynamic invocation.

The NoSQL community can also access dashboarding in their BI tool of choice (again, please consider Pentaho Business Analytics! ;o) ) through PDI Data Services over NoSQL data sources.

And this platform is available today, and has an open source core. I’d advise you to download it and try it out.

Where to next?

Next, to make it even easier for people, I’m going to write a few PDI extensions to abstract the API complexity of the NoSQL databases away from ETL tool users. Rather than use a REST Client Step to talk to MarkLogic Server, for example, I’ll instead create an easy to configure MarkLogic Input Step. I’ll do this for a bunch of databases too – all the big players and the cloud players. I’ll cover off the most common NoSQL patterns in these plugins. E.g. search in MarkLogic Server will be included too.

That will be my Christmas present to the NoSQL community this year! Ho ho hooooo!!! Merry Christmas!