Incapture Technologies

Inside the Cloud

Incapture Technologies Blog

 

Utilizing Java Watch Service

Published:
November 26, 2016
Author:

At Incapture we often implement data ingestion workflows for clients, typically as part of a larger re-engineering effort. Frequently, this involves waiting for and loading file-based data to arrive from another system or vendor. This is where Java’s Watch Service comes into play. Recently I was reading about Java’s Watch Service, which is included with the java.nio.file package, and thought this would could help us with client engagements.

Watch Service allows you to monitor directories and what types of events you want notifications for.  Events are create, modify and delete; more details here.

We have released  ‘WatchServer‘ as part of our open source platform. The server provides a file system monitoring capability that maps file system events to Rapture actions in a repeatable and configurable fashion.

Typically the action would be a Workflow.  As a reminder ‘Workflows’ in Rapture:

  • Are constructs that define a set of tasks (or steps) that need to be performed in some order:
  • Contain steps that can be implemented in various languages (Reflex, Java, Python etc)
  • Contain state that can be updated by each step
  • Manage step switching and execution via an internal pipeline
  • Can be initiated using Workflow API or attached to an event

There are many use cases we could support with this architecture plus Rapture platform capabilities, some of which are:

  • Loading csv file(s) to create time series accessible via Rapture’s Series API
  • Loading pdf file(s), indexing them and making them searchable via Rapture’s Search API
  • Loading xml file(s) and transforming to (json) documents accessible via Rapture’s Document API

To illustrate i’ve developed a workflow to load a SamplePriceData.xlsx file, extract data from each row and create a (json) document for that row in a Rapture document repository.

The WatchServer detects ENTRY_CREATE events and runs the workflow, which does:

  1. Loads a file from /opt/test and stores it in a blob Rapture repository blob://archive/yyyyMMdd_HHmmss/SamplePriceData.xlsx
  2. Create a Rapture document repository containing one document for each row in the spreadsheet document://data/yyyyMMdd_HHmmss/ROW000001..N. This uses Apache poi, a Java API for Microsoft documents, to extract data from the spreadsheet.

It is straightforward to setup and run locally using the process set out in the README.md using images from Incapture’s public Docker Hub account.  Make sure to install Docker on your local system first! I use Docker for mac.

Once the workflow has been run once you can view the results in default Rapture system UI on http://localhost:8000.

The archived xlsx file saved as a blob:

archive repository

and the subsequent documents created in document://data repository:

screen-shot-2016-11-25-at-9-48-41-pm

Using WatchServer in conjunction with Workflows gives you a flexible but defined approach to implement your domain specific data loading processes. Plus the benefits from the built-in operational support Rapture provides.

If you’d like more information about Incapture or Rapture please email me jonathan.major@incapturetechnologies.com, or to our general email address info@incapturetechnologies.com and we will get back to you for a more in depth discussion.


Subscribe for updates