IDPF

The ISS Data Processing Facility

Here are some thoughts about the broader issues which will affect the design of this facility. This is neither exhaustive nor complete; it is just meant as a starting point for a discussion of the design. Ultimately what we are able to build will be limited by our resources. Like the homeowner who is remodelling the house, we will define our desires first and then start removing rooms when we get the bid from the contractor!

Users

An examination of a few types of user is a good way to start thinking about the requirements for the IDPF. There are a number of distinct ways in which this tool will be utilized, and each will have very different modes of interaction. The basic functionality has to support the needs of at least the following types of users:

Data Processing/Quality Control

These are the staff members who are carrying out the routine post processing of the field collected data sets. These users needs a structure which allows configuration of the post processing procedures, and then the ability to execute it in a batch mode. Capability for intermediate checks (tabular or graphical) is required.

Typical scenario

The SSSF data group receives data from a project which operated 4 ISS sites. From each site were received a QIC tape with CLASS data, and ANSI optical disk containing surface met data, and an optical disk from the profiler containing spectral data. The data from each instrument type needs to be carried through a sequence of formatting and quality control steps.

In this case, a data processing configuration would probably be already designed and available to the staff in a "turn key" form, although it might need some fine tuning. The technician would carry out the following steps:

  1. Set up a project configuration. This would include designating source and destination directories, choosing processing options, and specifying output products.
  2. Execute "steps" in the post-processing sequence. A given step could involve a multiple number of transformations.
  3. At the end of the processing, the data products would be archived and/or delivered to RDP for distribution.

The breaks between steps are provided simply to allow the process to be halted at one stage before proceeding on to the next. There may be (for whatever reason) a need to repeat a step, and the steps would be designed so that the preceding one could be repeated with out needing to return to the very start of processing. It is also useful to have discrete steps defined for operations that require operator interaction, such as graphical viewing or data editing. You need to be able to initiate the later types of activities on demand, rather than having them happen whenever the processing scheme happens to get to that point.

Scientists

The scientists are typically developing analysis algorithms, or analyzing data sets in support of their research activities. In either case, they require an environment which makes it easy to manipulate and peruse data sets. In this type of usage, the scientist will often iterate over a process of modifying the analysis, running the data through it, and viewing the results. This users needs a framework that supports: importing, extracting and organizing data sets, applying the analysis algorithms, and developing and using custom graphical displays. Frequently the scientist will be examining and combining data from completely independent observing systems.

Typical Scenarios

A scientist is studying TOGA/COARE data in relation to the Madden-Julian Oscillation. She wants to examine a time series of CAPEs that are calculated from ISS CLASS soundings, by looking at a display of the time series and a display of the power spectrum of the time series. She also wants to examine individual soundings, and "knock out" obviously erroneous data segments which are invalidating the CAPE calculations. She will make a first pass and examine the displays mentioned, looking for wild results. If she notices areas where the results are suspicious, she will examine (and edit, if appropriate) the associated soundings, re-run the computations, and examine the graphical products. She will iterate through this process several times.

In another scenario, a scientist is experimenting with wind derivation techniques using the spaced antenna profiler. His basic data set is a complex time series of radar returns from the four receiver panels. These time series are subjected to a variety of parallel processing paths in order to compare winds computed by a variety of methods. He will be writing and modifying the processing algorithms, and will be creating a new output product, using the same input data, for each run. It is important to note that the output products can be identical in all appearances (time and date, variable names and dimension, etc.); that their only difference may be the method used to compute them. He then wants to make visual and quantitative comparisons between the methods. In addition, he will "instrument" some of the programs along the processing chain in order to understand and verify some part of the algorithm. He will be devising new graphics displays almost continuously, as he develops and refines his algorithms.

Software Engineers

The software engineers will use the system in much the same manner as the scientists and data processors. They will typically be building individual analysis modules that are used for post processing and analysis, and will be designing and testing IDPF configurations to support both.

Requirements

Here are some ideas regarding useful features that the IDPF might provide. The following is list of ideas that come to mind when thinking about IDPF users and their typical needs.

  1. Data set organization
    This refers to the management of "raw" data, intermediate results, and output products. One needs to be able to easily configure, change and augment the structure of the data base. "Platforms" may have short lifetimes. The formats of data within a platform may change frequently. The Zeb data store appears to be an excellent tool for meeting these needs. Perhaps all that would need to be added would be a graphical configuration tool for "ds.config"?
  2. Project organization
    The term project is used in a very loose sense here. It represents the collection of configurations that are used for whatever task is being carried out. These configurations tie together data sets, processing steps, graphics configurations, and other things that I haven't thought of yet.
  3. Process flow: design and control
    This refers to the ability to string together a sequence of processing steps, with specification of the input sources, filters, and output sources. Since it is possible to have multiple inputs, and t-eed outputs, this is better thought of as a "process net". Tools would be provided for both specifying and executing the process net. Ikp and AVS/Express provide two good examples of a visual programming approach to this requirement. A scripting language would probably be adequate as well. Nested process nets would be nice. A break-point capability could also be useful.
  4. Graphical display
    A flexible interactive presentation environment is essential. There is a need for both sophisticated built in graph types (e.g. skewt) as well as the ability to create custom graphical displays from low level building blocks. Tight coupling between graphics and analysis routines is desirable, e.g. so that a user can utilize the graphical display to select points for editing or to define a region that selects data for the next analysis step.
  5. Modular processing
    The most manageable approach will be to isolate the analysis activities into independent filters and functions which perform only a few operations in a given pass. These could be either stand-alone processes or linkable functions. Perhaps a common interface can be defined so that the same code could be used in a process (i.e. Unix filter) approach as well as for the interactive analysis mode discussed next. The interfaces would need to be fairly simple and strictly defined, so that users can easily design and integrate their own routines. Tested and useful analysis modules would be placed into a library so as to be available to other users. An individual user would be able to maintain thier own private collection of analysis modules.
  6. Interactive analysis
    A very useful feature for the scientist would be the ability to manipulate and analyze data sets in an interactive environment. PV-Wave, Splus and Matlab are examples of packages that provide this sort of functionality. The main issue concerning this capability is how to integrate it with the other requirements listed above. Whatever tool is used or developed here needs to be able to interact with the data store, project configuration and graphics display in a transparent manner. Since these types of packages are being loaded with increasingly useful features, many of which overlap with our other needs, we could consider using a third party package explicitly as the base for the IDPF development. The third party package could be operated in a "batch" mode for the routine data post processing.
  7. Freely distributed software
    It is desirable that NCAR is able to freely distribute the package or some of its major elements. It is possible that the package could depend on either commercial or public domain supporting software, or that only limited functionality is available in versions used without supporting packages.

IDPF Components

Here is a first cut at defining how the IDPF system might be broken down into named components or subsystems. It simply defines names for the first six of the requirements listed in the previous section.

  1. A data store , with a configuration editor and data management tools
  2. A project configuration editor
  3. A process controller
  4. A graphical display package
  5. An analysis library containing specific processing routines
  6. An interactive analysis tool