23 February 2017 User Stories

Providing extinction risk assessments for biodiversity with Elasticsearch

By Eduardo DalcinDiogo Silva

This project was shortlisted for this year’s inaugural Elastic Cause Awards, which recognizes organizations using the Elastic Stack to do great good in the world.

The National Center for Plant Conservation (CNCFlora) is an official agency of the ministry of the environment in Brazil, hosted at the Rio de Janeiro Botanical Garden. It is dedicated to realizing the risk assessment of Brazilian flora and coordinating the conservation efforts.

Currently, Brazil has listed 46,113 species of flora, but even with a continuous and dedicated effort of the CNCFlora, since 2010, only about 11% of these were evaluated for the risk of extinction.

Extinction risk assessment is the first step in the conservation of a species and should provide a scientific and objective assessment of the likelihood of a species becoming extinct at one time if the circumstances in which the species is found remains.

The assessment of the risk of extinction of all known species is a global challenge agreed by signatories to the Convention on Biological Diversity (CBD) through the target 2 of the Global Strategy for Plant Conservation (GSPC).

The remaining large knowledge gap about the conservation status of our flora, together with the challenge of evaluating it completely by 2020, shows that the use of technologies that allow rapid assessment of the risk of extinction supporting trained professionals to make decisions on the final categorization of species is essential.

A little help from technology

The risk assessment is a methodological process defined by the International Union for Conservation of Nature (IUCN) and adopted by many countries, such as Brazil. In this methodology, a set of data and information about the species are compiled and scrutinized by analysts, together with information about the occurrence of and potential threats to the remaining populations.

Part of this methodology uses the "occurrence records", which comes mainly from herbarium collections, formed by specimens of dried plants mounted on a sheet of cardboard accompanied by a label with the species name and additional data relating to that sample. These data represent the occurrence of a biological specimen in time and space and are the primary source of data for studies on biodiversity and conservation.

Thus, the methodology uses latitude and longitude of all the occurrence records to a species to calculates the Area of Occupancy, the Extension of Occurrence and Subpopulations.

To help the process of assessing the risk of extinction of all the Brazilian flora, we developed a tool called "Rapid Risk Assessment Application" - RRAPP (details here: https://goo.gl/DrQiIB), which proceed these spatial calculi and, based on a specific criteria, categorize the risk of the extinction of each species of the Brazilian flora.

Indexing on Elasticsearch and exploring on Kibana

Due to resource constraints on infrastructure and personnel, a tool that is simple and operates efficiently was essential. Therefore, Elasticsearch was the best choice.

The taxonomic information - the name of the species, the occurrence spatial data and analysis results are all indexed on Elasticsearch. A combination of queries of taxonomy (names) and occurrences (points) provide data for the assessment that saves the result (analysis) back into Elasticsearch.

Aggregations are used to extract reports and enable overview and statistics of the whole dataset or parts of it on demand while keeping overhead minimal.

With a big dataset that contains diverse data that do not always fit nicely into a standard vocabulary, we also need to rapidly iterate on the index-explore cycle, so using Kibana was only natural for this. With Kibana we can have a quick overview of the data as it is analyzed, perform changes and identify bias and problems to act upon. Here we can view the general result of analysis while they run.


We can also preview the occurrence point distribution and see that there is a huge concentration at the center (0 latitude and longitude) due to data quality problems.


Another important point for us is that this all can be performed daily on limited infrastructure: There is a single node with 4 cores and 8GBs of RAM running both the bots and the Elastic stack to index and analyze circa 100,000 species names and 15,000,000 occurrences data and provide the several aggregations and reports needed.


Scaling the future

For this first version of the tool we focused on assessing the whole Brazilian flora, but our aim is to do the same for all countries that would want such information about their flora.

With this global goal in mind, we are certain that scalability of Elastic products will enable the project to scale to the needed size since the global dataset is currently at a hundred times our current data.

Eduardo Dalcin is the head of Rio de Janeiro Botanical Garden's Scientific Computing and Geoprocessing Research Department that is responsible for the management of some of the institute's informatics initiatives, such as the National Centre of Flora Conservation(CNCFlora) and other data and integration projects.

Diogo Silva has 11 years of software development experience focused on backend and APIs, and has spent the last 6 years working at CNCFlora on this project.