2. Motivations

2.1. Data models are interfaces but inadequately managed or treated as such.

There are many examples of problems caused by subtle changes in a data file format. netcdf solves the problem of exchanging float and integer arrays with a common name and architecture independence, but that's really all it solves. Any other meaning on top of that is based on conventions, which must be known and agreed upon by all users and producers of the data files, and the support for those conventions in software is often implemented independently as well.

Recent example: the GLASS ASCII ingestor was changed to write 'time' instead of using the obsolete and older convention perpetuated by Zebra of 'base_time' and 'time_offset'. The newer convention is cleaner and more acceptable to IDV. After using the ingestor to convert some soundings, I was informed SUDS could not read the files, since SUDS does not support the newer convention. The 'CLASS netcdf format' was a de facto interface between the ingestor and SUDS. If SUDS access were through a higher abstraction in a 'data software layer', then the change in the file interface need not have been a problem. The data model needs to be recognized as a software interface and elevated to a defined piece of the software and data design process. It needs to be designed, specified, and published just like any software interface should be.

The idea of elevating the data model to a significant, formal entity in data management is not new. ARM had a very long and formal procedure laid out for designing, critiquing, and finally approving the data object design for an instrument data stream *before* the instrument came online. Essentially the data design was the CDL for the netcdf files, but as a formal entity the design was enforcement through policy of the conventions and names in the data model. Even with a software layer to implement data models, the design of a data stream will still be the most crucial step.

Most, if not all, of our data models are embedded in source code, but most of the people who need to access to the data models are not programmers. PIs need to know more details about the fields in their data files, such as QC history, or maybe even the units. Prospective PIs need to know what measurements are recorded for an instrument, and in what units. Data managers need tools to query and browse data files without having an intimate knowledge of the data stream. netcdf helps some, but there is nothing enforcing or facilitating something like a 'units' attribute for every field.

2.2. Consistency and sharing among data models in ATD

There is a standard to which all of our data should try to conform, such as consistent naming and interpretation of attributes, or even just the specification of units in a standard syntax which other applications can interpret. A data model layer is a natural place to facilitate and enforce standards. Each instrument to come online or each new data processing product should not need to design a whole new data stream.

There is lots of boilerplate code involved in writing and reading netcdf files, or serial connections, and that code only needs to be written once if it can be written to work from a simple data model.

I've been working on software in several different 'instrument domains' and coming across very similar needs, and in fact very similar code. The MAPR data file interface creates a few different kinds of netcdf files for storing the MAPR time series and correlation functions. The code for creating those files is a long list of variable and attribute definititons, and then there's lots of boilerplate code which marshalls data values and arrays into the file. Charlie was able to exploit some of the similarities between the two kinds of files in a base class, but that does not help so much in writing other tools which need to read those same datafiles.

2.3. Data format independence

Within ATD we've tried to find the 'holy grail' of data formats which would help us consolidate our data storage format and storage software interfaces. The fact is no such data format exists. Instead, we need to design the API which meets our needs and which we can extend easily to meet new needs. As much as possible the implementation behind the API can depend upon existing technology, but the use of the API insulates all of our software from the pecularities and limitations of the implementation. The storage backend can take advantage of a storage library like HDF or netCDF, DODS, or even a RDBMS, but that backend should not be exposed in the API. As an example, netcdf does not support a complex numerical type, however I think direct support for such a type would be useful in handling radar data. The data model layer can support a complex type while storing complex data to netcdf using some established convention which the rest of the software does not need to know. There is no need for every tool which needs to read and write complex values to re-invent their own conventions. [Right now MAPR uses an extra dimension of size 2, others might use I and Q variable name suffixes.] Other fundamental types of data can be imagined, such as wind vectors. Further, some data might benefit from different storage backends: time series fit easily into a RDBMS model and would allow much superior searching to anything we support now. Parts of the netcdf model which are merely convention, like units (meaning nothing verifies the syntax), should be full-fledged formal objects in an ATD data layer. The current prototype takes advantage of the Xerces XML library, other tools like ESML might also be useful.

2.4. Towards more generic processing

The more general the data model abstraction, the more generic and more widely applicable the processing and handling tools can be. We're working on writing a new archiver for ELDORA, but why shouldn't that archiver work for any instrument data streams? If we have time series QC processing tools, why shouldn't they work on all time series ATD records?

2.5. Data logic layer

In fact, some of the buzz words from the commercial sector like three-tiered technology, business logic, and middleware are very applicable here. In our case, our business is data: data models and data sets are our business objects and business logic, call them the 'data logic layer'. For all the same reasons that businesses benefit from consolidating their business logic in middleware, we would benefit from implementing our 'data logic' in a 'middleware' software layer.