Bering Sea Project Documentation and Format Guidelines

BEST Format and Documentation Guidelines

An inherent flexibility of the EOL data management system permits data in all different formats to be submitted, stored and retrieved from CODIAC.  EOL is prepared to work with the participants to bring their data to the archive and make sure it is presented, with proper documentation, for exchange with other project participants. In anticipation of receiving many data sets from the field sites in ASCII format we are providing guidelines below that will aid in the submission, integration and retrieval of these data.  EOL will work with any participants submitting other formats including NetCDF, AREA, HDF and GRIB to assure access and retrieval capabilities within CODIAC.

The following ASCII data format guidelines are intended to help standardize the information provided with any data archived for the project.  These guidelines are based on EOL experience in handling thousands of different data files of differing formats.  Specific suggestions are provided for naming a data file as well as information and layout of the header records and data records contained in each file.  This information is important when data are shared with other project participants to minimize confusion and aid in the analysis.  An example of the layout of an ASCII file using the guidelines is provided below.  Keep in mind that it is not mandatory that the data be received in this format. However, if the project participants are willing to implement the data format guidelines described below, there are some improved capabilities for integration, extraction, compositing and display via CODIAC that are available.  

NAMING CONVENTION

A)   All data files should be uniquely named.  For example, it is very helpful if date can be included in any image file name so that the file can be easily time registered.  Also include an extension indicating the type of file:
i.e.       .gif = GIF image format file
.jpg = jpg image format file
.txt = Text or ASCII format file
.cdf = NetCDF format file
.tar = archival format
If compressed, the file name should have an additional extension indicating the type of compression (i.e. .gz, .z, etc.).

 

B)   For Text (ASCII) files, the records should consist of both header records and data records. The header records at a minimum should consist of:

 

ASCII DATA FORMAT HEADER RECORDS SPECIFICATIONS

Standard header records should precede the data records within the file itself. The header records should contain the following information:

PI/DATA CONTACT =         Text [PI and data contact name(s) and affiliation(s)]
DATA COVERAGE =          Start/Stop time of continuous data or sampling interval (Use data/time format described below)
PLATFORM ID/SITE =                    Text [e.g. Cruise #, Station(s) #, Mooring ID, etc. or site name/designation for Mooring, village, fixed site, etc.]
INSTRUMENT =                  Text [Mooring, CTD, VPR, etc.]
LOCATION =                        Text [Range of Lat, Long coordinates or Sea name]
DATA VERSION =               Alphanumeric [unique ID (i.e. revision date, PRELIMNARY or FINAL]
REMARKS =                         Text [PI remarks that aid in understanding data file structure and contents. Items such as file type, how missing and/or bad data are denoted or any other information helpful to users of this data]

 

Missing Value indicator - Text or integer [value used for data for missing information] (e.g. -99 or 999.99, etc)

Below Measurement Threshold - Text or Integer [Value used to signify reading below instrument detection threshold] (e.g. <0.00005)

Above Measurement Threshold - Text or Integer [Value used to signify reading at or above instrument saturation]

**NOTE** This type of header information cannot be contained within GIF and Postscript files. They will need to be submitted with attached files or separate documentation containing this information.

 

DATA RECORDS SPECIFICATIONS:

First data record consists of parameters identifying each column.  Multiple parameter names should be shown as one word (e.g. thaw_depth or Leaf_area_index)
Second data record consists of respective parameter units.  Multiple unit names should be shown as one word (e.g. crystals_per_liter)
Third data record begins actual data and consists of a standard variable set (see item 4) and subsequent observations at that time and position.
Data records need to include a common set of standard variables so that data is placed within the proper context of where it was collected. The appropriate standard variables vary somewhat depending upon what instrument was used to collect the data. In general the standard variables that should be included in every cruise dataset submitted for BEST should include: Cruise ID, Station Name, Station #/Mooring #, Event(Cast) #, Bottle #, Date, Time, Latitude, Longitude, Depth, in that order followed by other PI data. For Terrestrial datasets, the standard variables that should be included are Date, Time, Latitude and Longitude in that order followed by other PI data (see examples below).

 

  • Date/time must be in UTC and recommended format is: YYYYMMDDHHmmss.ss

where:            YYYY= Year
MM = Month (00-12)
DD = Day (01-31)
HH = Hour (00-23)
mm = Minute (00-59)
ss = Second (00-59)
.ss = Decimal Second (unlimited resolution based on sampling   frequency)

  • For every data set, position coordinates (i.e. latitude, longitude) should be expressed in decimal degrees for each data point.  Water Depth should be reported in appropriate metric units. This may be done by: (a) providing date/time of collection with position coordinates in each data record; or (b) providing date/time of collection for each data point in the submitted file, with an associated file containing date/time and location either from the platform navigation database or GPS file.

 

Latitude -       Northern hemisphere expressed as positive or "N" and
Southern hemisphere expressed as negative or "S".
Longitude -   0-360o moving east from Greenwich; west longitude goes from
180o to 360o; or
Eastern hemisphere expressed as positive or "E" and
Western hemisphere expressed as negative or "W".

NOTE – Position information in other grid conventions is acceptable but a conversion to latitude/longitude should be provided where practical.

NOTE - Having a common date/time stamp and common position coordinates in each data record will permit the ability to extract data and integrate multiple data records from different data sets.  If two times are provided (e.g. UTC and local), they should be put at the beginning of each record.

 

  • Preferred format for ASCII data files is space, comma or tab delimited columns, with a UTC date/time stamp at the beginning of each data record. If the data in the file are comma delimited, decimal places must be periods, not commas.
  • All data files must contain variable names and units of measurements as column headings (if applicable).

 

  • If, for some reason, the PI cannot provide the date/time in the format shown above, it is important that the time be given in UTC. If local time is also supplied, a conversion to UTC must be provided. In addition to UTC and/or local time, other date/time formats (e.g. decimal days) can be used but must be fully documented.
  • The internal format structure of the file should remain constant after the first data record to ensure continuity and permit plotting and graphing.

 

  • Only COMPLETE replacement or updated data/metadata files can be accepted.

 

SAMPLE CRUISE DATA SET (ASCII FORMAT):

The following is an example of an ASCII format data set in which the header precedes the reported data, and the data is organized in columns separated by spaces. Each column is identified by parameter and each parameter's units of measure are listed in the respective column. Also each row has a date/time of observation reported in Universal Time Coordinated (UTC) along with position coordinates. This data set organization is ideal for plotting and integration of various data sets. This data set format should be used whenever possible and could be easily produced automatically from a spread sheet computer program.

------------------------------------------------------------------------------------------------------------

PI/DATA CONTACT= Doe, John (Old Dominion)/ Doe, Jane (WHOI)
DATA COVERAGE = START: 19990821133500; STOP: 19990821135500 UTC
PLATFORM ID = HLY-02-01
INSTRUMENT = CTD Water Sampling
LOCATION = Barrow Canyon Station 12 (BC 4)
DATA VERSION = 1.0 (10 March 2003), PRELIMINARY
REMARKS = Old Dominion University
REMARKS = ppm values are mole fraction
REMARKS= nM/m3 at 25c and 101.3 kPa; DMS and NH4 in Parts per million (PPM)
REMARKS = Missing data = 99.9; Bad data = 88.8
REMARKS = Data point Date/Time provided in UTC

Cruise ID   Stn_Name   Stn   Evnt  Bottle   Date/Time         Lat         Lon   Depth        O2
     CHAR          CHAR     Int     Int      Int         UTC               Deg        Deg     M       UMOL/KG

HLY-02-01   NP13   12     3        4     20020602124326  68.2378   –155.6294  120  0.13
HLY-02-01   NP13   12     3        5     20020602124718  68.2378   –155.6294  120  0.18
.
.
.
etc.

SAMPLE TERRESTRIAL DATA SET (ASCII FORMAT):

The following is an example of an ASCII format data set in which the header precedes the reported data, and the data is organized in columns separated by spaces. Each column is identified by parameter and each parameter's units of measure are listed in the respective column. Also each row has a date/time of observation reported in Universal Time Coordinated (UTC) along with position coordinates. This data set organization is ideal for plotting and integration of various data sets. This data set format should be used whenever possible and could be easily produced automatically from a spread sheet computer program.

------------------------------------------------------------------------------------------------------------

PI/DATA CONTACT= Doe, John (U of Hawaii)/ Doe, Jane (NCAR)
DATA COVERAGE = START: 19990821133500; STOP: 19990821135500 UTC
PLATFORM/SITE = C130
INSTRUMENT = C-130 External Sampler Data
LOCATION = mobile
DATA VERSION = 1.0 (10 March 1999), PRELIMINARY
REMARKS = National Center for Atmospheric Research, INDOEX
REMARKS = ppm values are mole fraction
REMARKS= nM/m3 at 25c and 101.3 kPa; DMS and NH4 in Parts per million (PPM)
REMARKS = Missing data = 99.9; Bad data = 88.8
REMARKS = Data point Date/Time provided in UTC

       DATE/TIME      LAT             LONG          SAMPLE     NO2      CO     DMS     NH4
              UTC             Deg                Deg            NUMBER  nM/m3    ppm     ppm      ppm
19990821133500.00  -43.087       263.116          E1.160.1  1000.65  200.67  345.98  2342.980
19990821133510.00  -43.090       263.120          E1.160.2  1003.45  200.60  349.76  2353.345
.
.
.
etc.

 

Data Documentation Guidelines

The documentation (i.e. the "Readme" file) that accompanies each project data set is as important as the data itself.  This information permits collaborators and other analysts to become aware of the data and to understand any limitations or special characteristics of data that may impact its use elsewhere.  The data set documentation should accompany all data set submissions and contain the information listed in the outline below.  While it will not be appropriate for each and every dataset to have information in each documentation category, the following outline (and content) should be adhered to as closely as possible to make the documentation consistent across all data sets.  It is also recommended that a documentation file submission accompany for each preliminary and final data set.

 

---TITLE: This should match the data set name

---AUTHOR(S):
-Name(s) of PI and all co-PIs
-Complete mailing address, telephone/facsimile Nos., web pages and E-mail address of PI
-Similar contact information for data questions (if different than above)

---FUNDING SOURCE AND GRANT NUMBER:

---DATA SET OVERVIEW:
-Introduction or abstract
-Time period covered by the data
-Physical location of the measurement or platform (latitude/longitude/elevation)
-Data source, if applicable (e.g. for operational data include agency)
-Any World Wide Web address references (i.e. additional documentation such as Project WWW site)

---INSTRUMENT DESCRIPTION:
-Brief text (i.e. 1-2 paragraphs) describing the instrument with references
-Figures (or links), if applicable
-Table of specifications (i.e. accuracy, precision, frequency, etc.)

---DATA COLLECTION and PROCESSING:
-Description of data collection
-Description of derived parameters and processing techniques used
-Description of quality control procedures
-Data intercomparisons, if applicable

---DATA FORMAT:
-Data file structure, format and file naming conventions (e.g. column delimited ASCII, NetCDF, GIF, JPEG, etc.) 
-Data format and layout (i.e. description of header/data records, sample records)
-List of parameters with units, sampling intervals, frequency, range
-Description of flags, codes used in the data, and definitions (i.e. good, questionable, missing, estimated, etc.)
-Data version number and date

---DATA REMARKS:
-PI's assessment of the data (i.e. disclaimers, instrument problems, quality issues, etc.) Missing data periods
-Software compatibility (i.e. list of existing software to view/manipulate the data)

---REFERENCES:
-List of documents cited in this data set description

 

BSIERP Format and Documentation Guidelines

All metadata are to be submitted in XML file format, created by the MetaVist application. XML is an ASCII or text file. Please do not use editors that save files in proprietary formats (Microsoft Word, Wordpad, etc.) unless you know how to export it back out as ASCII or text. If you are editing the FGDC metadata file by hand, you might need to look at the FDGC standard for proper placement of XML elements.

The frequently asked questions (FAQ) are automatically generated using the metadata parser (mp) from the USGS FGDC metadata toolbox. The XML metadata records you submit are processed into various other supporting documents. BSIERP data managers will make additional modifications to the metadata record so that datasets will be integrated into the Alaska Marine Information System (AMIS).

Software Choices for Creating Metadata

  • MetaVist
    • Free
    • Requires Microsoft Windows. Version 1.6 or MetaVist 2005 (December 2007)
    • Can choose between .NET versions 1.1 and 2.0 (.zip)
  • NPS Metadata Tools and Editor (v 1.1)
    • Published by the National Park Service and the U.S. Department of the Interior
    • Reads and writes FGDC XML-compatible files
    • Will produce several warnings when passed through the USGS metadata parser (mp)

Note: This tool supports only a subset of FGDC metadata attributes.

Both programs support the import of Taxonimic information via ITIS. However, you may use whatever program you like (e.g. ArcCatalog) to edit the XML metadata template as long as the resultant XML document conforms to FGDC metadata standards and is parseable by the metadata parser mentioned above.

Validating Metadata

There are several validation services which check your XML for proper formation.

  • XML Well-Formedness Validation Service
    • Checks to see if the XML is properly formed. If you are planning to edit XML files manually outside a FGDC metadata editor, you can use this tool to make sure the XML itself is valid.
      Check your metadata using this service
  • FGDC XML Metadata Validation Service
    • Checks finished, valid XML documents. Data Management will use additional software to produce human-readable documents and to validate content of the metadata record. Data Management may also request additional changes, which will be documented on this website.
      Check your metadata using this service