2017年3月28日 ユーザストーリー

Processing Marine Environmental Observations with Logstash @ The Marine Institute

著者 Adam Leadbetter

The Marine Institute is the State agency responsible for marine research, technology development and innovation in Ireland. We carry out environmental, fisheries, and aquaculture surveys and monitoring programmes to meet Ireland's national and international legal requirements. We provide scientific and technical advice to Government to help inform policy and to support the sustainable development of Ireland's marine resource. We aim to safeguard Ireland's unique marine heritage through research and environmental monitoring. Our research, strategic funding programmes, and national marine research platforms support the development of Ireland's maritime economy.

As a professional data manager for over a decade, all of which has been spent curating marine environmental data, I have spent countless hours reformatting raw data outputs from field instruments into standard data formats. That practice is just about sustainable when data are collected in a discrete way – from research vessels or short- to medium-term deployments of moored devices in the ocean. However, a new model for converting these raw outputs was needed when we deployed a subsea observatory in Galway Bay, tethered to the shore by fibre-optic cable and returning measurements including temperature, salinity, and current velocity every second.

1.jpg

For us at the Marine Institute, that's where the Elastic Stack and, in particular, Logstash came in. Treating the incoming raw data sources like conductivity-temperature-depth sensors or fluorometers as application logs, we pipe this data through Logstash, using the Grok and Mutate filters to produce a structured, fully standardized data format.

2.png

An example Grok filter for a conductivity-temperature-depth-sensor is:

Raw

2016-11-11T00:00:22.136Z|I-OCEAN7-304-XXXX|  27.55  11.930  35.666  30.855 1492.0759 00:07:48.89M

Grok Filter

IDRONAUT_OCEAN7_304 %{NUMBER:pressure:float}%{SPACE}%{NUMBER:temperature:float}%{SPACE}%{NUMBER:conductivity:float}%{SPACE}%{NUMBER:salinity:float}%{SPACE}%{NUMBER:sound_velocity:float}%{SPACE}%{TIME:raw_time}

Which parses a raw instrument output to (for example):

{
"message" => "2016-11-11T00:00:22.136Z|I-OCEAN7-304-XXXX| 27.55  11.930  35.666  30.855 1492.0759 00:07:48.89M\r",
"timestamp" => "2016-11-11T00:00:22.136Z",
"instrument" => "I-OCEAN7-304-XXXX",
"pressure" => 27.55,
"temperature" => 11.93,
"conductivity" => 35.666,
"salinity" => 30.855,
"sound_velocity" => 1492.0759,
"raw_time" => "00:07:48.89"
}

Once into this JSON semi-standard format, the Mutate filter can be applied to create a full-standard output.

Filter

uuid {
target => "id"
}
     
mutate {
     add_field => {
"[featureOfInterest][href]" => "http://linked.marine.ie/feature/exampleURI"
          "[member][0][type]" => "Measurement"
"[member][0][procedure][href]" => "http://vocab.nerc.ac.uk/collection/L22/current/TOOL0861/"
"[member][0][observedProperty][href]" => "http://vocab.nerc.ac.uk/collection/P01/current/TEMPPR01/"
"[member][0][result][uom]" => "http://vocab.nerc.ac.uk/collection/P06/current/UPAA/"
          "[member][1][type]" => "Measurement"
"[member][1][procedure][href]" => "http://vocab.nerc.ac.uk/collection/L22/current/TOOL0861/"
"[member][1][observedProperty][href]" => "http://vocab.nerc.ac.uk/collection/P01/current/PSALCU01/"
"[member][1][result][uom]" => "http://vocab.nerc.ac.uk/collection/P06/current/UUUU/"
          "[member][2][type]" => "Measurement"
"[member][2][procedure][href]" => "http://vocab.nerc.ac.uk/collection/L22/current/TOOL0861/"
"[member][2][observedProperty][href]" => "http://vocab.nerc.ac.uk/collection/P07/current/CFSN0330/"
"[member][2][result][uom]" => "http://vocab.nerc.ac.uk/collection/P06/current/UPDB/"
}
     rename => {
          "timestamp" => "[phenomenonTime][instant]"
          "temperature" => "[member][0][result][value]"
          "salinity" => "[member][1][result][value]"
          "pressure" => "[member][2][result][value]"
     }
remove_field => ["@timestamp","@version","message", "host", "raw_time", "instrument", "conductivity", "sound_velocity"]
}
     
mutate{
     add_field => {
          "[member][0][resultTime]" => "%{[phenomenonTime][instant]}"
          "[member][1][resultTime]" => "%{[phenomenonTime][instant]}"
          "[member][2][resultTime]" => "%{[phenomenonTime][instant]}"
     }
}

Output

{
"phenomenonTime" => {
"instant" => "2016-11-11T00:00:22.136Z"
},
"member" => {
"0" => {
"result" => {
"value" => 11.93,
"uom" => "http://vocab.nerc.ac.uk/collection/P06/current/UPAA/"
},
"type" => "Measurement",
"procedure" => {
"href" => "http://vocab.nerc.ac.uk/collection/L22/current/TOOL0861/"
},
"observedProperty" => {
"href" => "http://vocab.nerc.ac.uk/collection/P01/current/TEMPPR01/"
},
"resultTime" => "2016-11-11T00:00:22.136Z"
},
"1" => {
"result" => {
"value" => 30.855,
"uom" => "http://vocab.nerc.ac.uk/collection/P06/current/UUUU/"
},
"type" => "Measurement",
"procedure" => {
"href" => "http://vocab.nerc.ac.uk/collection/L22/current/TOOL0861/"
},
"observedProperty" => {
"href" => "http://vocab.nerc.ac.uk/collection/P01/current/PSALCU01/"
},
"resultTime" => "2016-11-11T00:00:22.136Z"
},
"2" => {
"result" => {
"value" => 27.55,
"uom" => "http://vocab.nerc.ac.uk/collection/P06/current/UPDB/"
},
"type" => "Measurement",
"procedure" => {
"href" => "http://vocab.nerc.ac.uk/collection/L22/current/TOOL0861/"
},
"observedProperty" => {
"href" => "http://vocab.nerc.ac.uk/collection/P07/current/CFSN0330/"
},
"resultTime" => "2016-11-11T00:00:22.136Z"
}
},
"id" => "9866037d-571d-4f7b-b69a-db9a577b6584",
"featureOfInterest" => {
"href" => "http://linked.marine.ie/feature/exampleURI"
}
}

This output is pushed to an Apache Kafka message queue, and from there is stored in an Apache Cassandra database which is made available to the general public at http://erddap.marine.ie. In the geospatial data world, the main standards body is the Open Geospatial Consortium (OGC), who also work with the World Wide Web Consortium to create best practices for publishing geospatial data on the web. One of the OGC standards is targeted at observations – Observations & Measurements – and had recently been translated from XML to JSON [1] and it is this JSON encoding we target from the Logstash Mutate filter.

3.png

One of the ideas that has been kicking around in the marine environmental data management community for the last few years is "Born Connected." This takes the Born Digital idea of data collected by computer rather than on a physical record and extends it into the Semantic Web and Linked Data worlds – that data arrive from the instrument with web addresses to parameter definitions; unit definitions; and other documents built right into the data files. The Born Connected approach will allow for the better integration of observed data directly into weather forecast models. It also allows users to more easily discover data relevant to their needs, and gives researchers the ability to automatically create reports from data. However, making this scale has been an issue, and how to collect together the patterns used to give birth to the connected data hasn't been addressed.

By beginning a GitHub repository for our Grok patterns [2], they can be reused by any other data collecting organisation to parse their raw instrument data using Logstash. Similarly, GitHub pull requests will allow others to contribute to the register of Grok patterns and for a comprehensive archive to be created.

The observatory in Galway Bay is an important contribution by Ireland to the growing global network of real-time data capture systems deployed within the ocean – and technologies like Logstash from the Elastic Stack help giving us new insights into the ocean which we have not had before. We have been able to construct a unique time series of data for Ireland, with a year of high frequency measurements at a single location and over time this will grow into a longer time series and an important environmental data series for monitoring changes in the water temperature; salinity and currents. We have partnered with the INSIGHT Centre for Data Analytics at the National University of Ireland, Galway in a European Union project (OpenGovIntelligence) to work with end users such as search and rescue teams; sailors and renewable energy developers to make the data available to them in the ways which suit their specific needs.

There's more technical detail in the paper we presented to the December 2016 IEEE Big Data Conference on this topic.

We still have some work to do in processing different data types in this way, including 2-D data (such as profiles of current velocity through the ocean) and in processing binary raw data feeds but we're looking forward to tackling those challenges.


4.jpg Adam Leadbetter Data Management Team Lead, Marine Scientist Adam holds a BSc in Oceanography with Chemistry and a PhD in Coastal Oceanography. Over the last decade he was worked as a marine data manager with both the National Oceanography Centre (UK) and the Marine Institute (Ireland). Adam is particularly interested in applications of Semantic Web technology to marine science data and has authored a number of papers on this topic. When he's not looking out to sea, Adam can usually be found running very long distances, preferably up and down mountains.

---

[1] https://portal.opengeospatial.org/files/?artifact_id=64910

[2] https://github.com/IrishMarineInstitute/grok-raw-inst/tree/master/patterns