The “Biodata Services Stack” (BSS) is a development to support the seamless exchange of biodata between multiple data providers and multiple data users, forming a national federated data network.
The stack will include:
- National standards, guidelines and procedures that ensure key elements offreshwater and terrestrial biodata related to species occupancy/occurrence can be archived and published consistently in any organisation across New Zealand;
- National systems to support consistent archiving and publishing of biodata across New Zealand;
- Standards-based web-services to enable biodata federation.
This document is intended to outline principles and options for the implementation of this stack based on existing data, metadata, transfer and storage standards.
Note: for the purposes of this document the word organisation is used to include all organisations, institutions, councils, government agencies, public or business organisations involved in the collection, dissemination and use of data including, but not limited to, sample, observation and occurrence data.
- Each organisation may have one or more datasets to be federated.
- Each organisation will need to update / compile those datasets for conformance with the Biodata Services Stack data standards, including:
- Metadata describing the content of the dataset, how it was created and how it may be used.
- Data standards for each data record including
- National identifiers e.g for institutions, taxon;
- Mandatory fields;
- Field name conventions;
- Vocabularies and rules for field values;
- Temporal and geographical extent formats.
- Each standards-compiled dataset will be archived, either within the originating organisation, or by arrangement with another organisation.
- Archived data will then be federated, by publishing the data from the archives and enabling harvesting of that data through agreed standard publishing mechanisms.
- Published metadata will enable the discovery of datasets that may assist end users with their use case requirements.
- Published data can then be harvested to local caches or mobilised for direct use by end users.
This will require:
- Agreed standards on data structures and data exchange formats enabled through national standards (NEMS).
- Data providers implementing organisational workflows for data management and archiving consistent with the standards and infrastructure.
- Data providers implementing data publishing mechanisms consistent with the standards and infrastructure.
- National tool sets (conformance testing, vocabularies, publication infrastructure etc) to ensure organisations have tools at hand to ensure their institutional processes conform to standards.
- National brokering services to support the infrastructure, including
- Some sort of national registry system (incl. responsible organisation / governance)
- A range of national vocabulary systems (incl. responsible organisation / governance)
- System for providing stable URIs / domains to support above (incl. responsible organisation / governance)
- Data consumers / users to implement data access interfaces consistent with the standards and infrastructure.
Standards for Data Preparation / Compilation
When using data for analysis it is important to know key facts about that data and its origins. Without that information and, an understanding of its implications, it is not possible to draw definitive conclusions from any analysis of that data.
When drawing multiple datasets together, away from their point of origin both geographically and temporally, it is important that relevant metadata about the dataset travel with the data.
It is also of assistance to future analysis the data fits a common standard of meaning and vocabulary. The originators of the data are best placed to undertake this standardisation as they will have the better understanding of the original data and how it is best translated to fit the standards being applied.
The Biodata Services Stack (BSS) must, therefore, specify standards to be used for both the metadata about the dataset and the data itself.
Data Structure - standard data format
Darwin Core is the primary relevant standard to apply to data covering taxa and their occurrence in nature. It is already an internationally accepted standard, a great deal of work has already been done to standardise the terms used, and it already has widespread tools in place to support its use.
See http://rs.tdwg.org/dwc/ for a full description.
Darwin Core includes an extensive list of terms each of which is carefully defined and has commentaries to promote their consistent use. For many of the terms there is a suggested vocabulary.
Some consideration will be required, however, as to the best application of such a standard to the BSS a) to fit the data collected by data providers; b) to meet the specified use cases of data consumers; and c) to facilitate the use of computer technology to that end.
E.g. do the providers record data at a single spot or an area of invasive weeds?
Consideration should be given to:
- Which terms should be included and which can be ignored?
- Which, if any, of the terms will be mandatory, which preferred and which optional?
- Which terms require controlled vocabularies and
- What is in those vocabularies?
- What specification is necessary for other terms to ensure consistency between datasets - including the use of standard reference data, and
- Which terms can be free form text?
There may, of course, be terms required by use cases that do not conform to the Darwin Core standard. How best to handle these – by formal extension of the standard, by informal extension locally, or by making use of the dynamic properties term – must be discussed along with the other considerations.
If the dynamic properties term is to be used, then standards must be considered for its structure and naming conventions within that structure.
The GISIN (Global Invasive Species Information Network) Protocol Specification, http://www.gisin.org/cwis438/websites/GISINDirectory/Tech/ProtocolSpecification.php , is an example of a standard, based on Darwin Core, that has already been through a similar design process and which may form a good starting point for a data standard for the BSS.
Note that the GISIN protocol allows for other sets of data about pest species, in addition to occurrences.
Recommendation: to build a BSS Data structure based on Darwin Core.
Reference data - standard vocabularies
Fundamental to the standardisation of data across organisations and their datasets is the use of standard reference data to better ensure that data created by one organisation or individual is fully understood and usable by another. I.e. apples are compared with apples rather than pears.
All biodata will require a standard taxonomic reference and the New Zealand Organisms Register (NZOR), http://www.nzor.org.nz/ was created to provide a common central reference for New Zealand. However other reference sources may be required where NZOR does not have coverage e.g. marine organisms.
Consideration must be given as to other candidates for standard reference data, either making use of existing standards or by creating new standard references.
However, consideration must also be given to:
- What action to take if the data item does not appear in the reference?
- Can it be added to the reference? How? Who is the authority?
- Can data be provided without a match to the reference e.g. an invasive foreign weed does not yet appear in NZOR
- What other data must also be included with the reference e.g. a reference/identification date?
Occurrence data must be published according to the agreed standards and using agreed mechanisms. But exactly how and in what format will depend on the end user needs and infrastructure available for accessing the data. Vocabularies required are that of observational variables related to species occupancy.
Recommendation: to build a BSS taxa vocabulary based on NZOR; to build a new BSS observation vocabulary for observational variables related to species occupancy.
Darwin Core includes terms that contain metadata about the data and if completed fully these may meet the need for basic descriptive metadata.
However, to assist with data discovery, a catalogue of available datasets can be created using more targeted and specifically designed metadata.
There are several predefined standards for storing metadata but they all use very similar fields including descriptions of the data, it’s source and creation, it’s temporal and geographic extent and it’s status.
Metadata for each dataset needs to be archived such that:
- It is easily published and easily discovered by data consumers;
- It can be updated as more data is added to the dataset;
- End users, or automated data harvesters, can easily discover the latest update date.
Consideration needs to be given as to how potential data consumers can easily discover the metadata. Some possible options are:
- Each organisation stores their own metadata and makes it available for discovery. This requires a central / national registry mechanism to enable discovery.
- There are one or more central catalogues of participating organisations that have pointers to the individual catalogues;
- All metadata is harvested regularly, on update, to one or more central catalogues.
(Note:One option is to use Geonetwork, a catalog application to store the metadata using the ANZLIC metadata profile mentioned previously. One application can enable users to search across multiple catalogues or it can facilitate the harvesting of data from one catalogue into another. For more details see http://geonetwork-opensource.org/ and see an example at http://dc.niwa.co.nz/niwa_dc/srv/eng/main.home.)
The final format of the metadata may depend on the technology to be used but both the GBIF Metadata Profile, http://www.gbif.org/resources/2559 , and ANZLIC metadata Profile, http://www.linz.govt.nz/geospatial-office/about/projects-and-news/anzlic-metadata-profile, can provide suggestions for the terms to be included.
The terms to be used need to have the same considerations applied as for the data standard.
The system to be used, and therefore the format to use for data exchange, may depend on whether central collection of metadata is a requirement.
Recommendation: to mandate a national BSS metadata profile based on existing profiles and recommendations / guidelines for agencies on how to implement.
See the data archiving section, as the option suggested there for using Geonetworks also covers the publishing of the metadata to facilitate discovery.
The GBIF IPT also includes metadata entry to accompany a data archive when harvested.
- Web Feature Service WFS. Specified by the Open GeoSpatial Consortium (OGC http://www.opengeospatial.org), WFS provides and interface allowing requests for geographical features. I.e. A call to the interface will include some limit on geospatial boundaries, and can include other filters dependent on the content of the feature e.g. on date. The data is returned in XML based GML. This is particularly useful for viewing data on a map and then clicking through to view more detail about that data.
- Web Map Service WMS. Specified by the Open Geospatial Consortium (OGC http://www.opengeospatial.org) WMS is a standard protocol for serving georeferenced map images over the internet.
Both WFS and WMS can be used by various freely available systems, including NIWA’s Quantum Map http://www.niwa.co.nz/software/quantum-map which is based on and simplifies the Quantum GIS application http://www.qgis.org.
Complete dataset download
For detailed data analysis the full set of data may be required and, if it is file based – Excel, csv, Darwin Core Archive – it could be available by http or FTP file download.
Partial dataset download
If the data is not file based, and particularly if it is a large dataset, a web service could give access to first narrow down the data required and then download a subset of the data.
Recommendation: depending on outcome of the use cases, a standard is defined how data shall be published in a federated environment