Elaboration

Domain Modeling

After inception begins the more involved process of elaboration. This is where Fowler mentions the importance of use cases in discovering and then describing the ways in which the desired system will be used, or the system's required abilities. The biggest challenge here is to reconcile the (sometimes incongruent) views of the problem between the domain experts (and the eventual users of the system) and the programmers. That reconciliation requires very careful and thorough communication, and use cases can be the primary form of that communication. If the use cases paint a clear picture of what the programmer thinks the system needs to do, then the users will be able to tell the programmers in which places they have missed the boat. So to those domain experts reviewing this document, make very sure the use cases represent your desires for the system, so that the wrong system is not implemented. (Hence the so-called requirements risk.) Note that the UML specifies a notation for use case diagrams, but those are not use cases in themselves. A use case requires a description and supporting text to be useful.

The UML Distilled book states a poignant quote from Brad Kain: "Analysis occurs only when the domain expert is in the room (otherwise it is pseudo-analysis)." So far I've only done pseudo-analysis. I need to do some real-life interviews and get an honest exchange going with the domain experts.

Class diagrams drawn from the conceptual perspective are another important documentation tool during elaboration. Naturally, the elaboration of use cases influences the class diagrams, and the attempt to model the domain with class diagrams can influence the use cases.

The goal, however, is not to produce a set of pretty class and use case diagrams. The goal is an accurate understanding of the problem domain for both user and programmer: a mutual understanding, faithfully represented and verified through documentation and a common modeling language. I keep forgetting that. A strength of UML Distilled is that rather than just explaining the semantics of UML notation, it also explains where the diagrams fit into an analysis and design process. Diagramming is not sufficient; often brief, written explanations are more effective. Without drawing another line of UML, just working through a design method and documenting the results is a big step up for my software development, or so I'm hoping this exercise will show.

Domain Expert Interviews

Charlie and I chose a few key experts in the problem domain of profiler data management and analysis, and also in data processing of SSSF facility data in general. It is interesting to note that so far the programmers have chosen the experts. I'm thinking it makes more sense for us to also ask the domain experts to identify the other domain experts, so that we make sure we don't miss anyone. Maybe that's not so important for such a small project, but I think it it's an interesting consideration and less obvious for in-house developments.

Also, Charlie and I, and other programmers, are potential users of the system. (See the Programmer actor in the Use Cases section.) So perhaps we should interview ourselves, to make sure we have really thought about how we will use the system rather than only how we would like to design it.

Erik Miller

Erik mostly performs the roles of Scientist and Data Manager. For example, he selects the quality control and processing algorithms which generate the final datasets for distributions, and he responds to researcher requests for those datasets. Also, he manages data from multiple facilities, taking control of the data once they arrive from the field, which involves intimate (and sometimes painful, I imagine) awareness of disk space allocation and availability on the ATD network.

There is some difficulty in identifying the tasks which the proposed tool should, would, or could help Erik accomplish, considering that a choice has not yet been made between the profiler processing alternatives. So we need to make sure that the requirements and use cases include the need to assist in that choice. This is the data analysis function (see Analysis, processing a data set multiple ways and then comparing and visualizing the results to identify an acceptable processing algorithm. Once an algorithm or set of algorithms has been chosen, the system needs to batch process data using the specified algorithms. The scenario is thus: Existing DBS profiler data are processed to identify the appropriate algorithms and parameters, after which those algorithms and parameters are used to process future DBS profiler data. Future analysis might be necessary to identify improvements or new alternatives to the current processing, and those alternatives should also be easily available for batch processing new data.

One interview question I thought might be useful was how the domain experts think their work should look in the short term future, at least relative to the problem domain. Erik hopes the profiler analysis methods will be in place and in use for automatic batch processing of incoming profiler data. This means choosing the moments, QC, and wind and temperature derivations from profiler spectral data, which again comes back to the analysis scenario above.

I tried to categorize all the information from the interview into the following sections.

Shared QC Processing

In one of the first discoveries, the interview turned up the scenario of multiple people sharing in the QC processing of a dataset. For example, Errol runs the dropsonde processing, and Erik checks skew-t plots of the soundings before approving the final product. This is a use case I would not have included without the interview. This scenario might imply introducing a new actor, someone between a Data Manager and Technician. I think it suffices to use one or both of the existing roles and to highlight the possibility of multiple actors in the QC Processing use case. Note this requires some way for the multiple roles to track and exchange processing information, such as algorithm parameters or processing progress.

Vocabulary

We talked about a few common terms and tried to narrow them down: facility, instrument, sensor, instrument package. Examples of facilities are ISS, ISFF, GLASS, so perhaps a workable distinction is that a facility is "deployment unit". Note that ISS includes GLASS, and something like MAPR might someday become part of ISS, so I guess a facility does not preclude the inclusion of other facilities. An instrument, such as a profiler, is part of a facility. Or an instrument can also be known as a sensor. I don't know if there is any useful distinction. An instrument package can be known as simply an instrument, except that it is an instrument comprised of multiple sensors. I don't even know if it is worth distinguishing and trying to define these terms more clearly or not. It might be useful to an outsider trying to understand our terminology, but internally maybe the understanding of the terms is clear and consistent enough already.

Erik used the terms "working products" and "final products". The working products are the intermediate results of alternate analyses, such as different correction schemes, which will not be released to researchers. The final products are the data which are released to researchers after being processed for quality control.

Existing Tools

I asked Erik about what tools he currently uses for processing needs. For soundings, there is SUDS: http://www.atd.ucar.edu/rdp/suds.html, sounding analysis in LABEX95 (found with the ATD site search engine!) Some sounding reprocessing used to be done through the CLASS program, but that has been replaced by GLASS. Profiler processing has been comprised of several programs: the Profiler Operating Program (POP) which controls the DBS radar in the field and generates winds and temperatures with a consensus algorithm, a fuzzy logic program from RAP which has been adapted to read profiler data from netCDF files with our local conventions, and a program which implements the Weber-Wuertz algorithm.

Since much of Erik's work requires comparing quality control and correction algorithms, and in the case of the profiler, derivation algorithms for meteorological measurements, visualization and graphical tools are extremely useful and important. In particular, he has used SPlus to generate plots for the profiler algorithm comparisons and for documentation of project datasets.

Documentation Products

Erik's example of the SCMS dataset documentation (or data quality report) suggests another possible use case: the ability to compile the important and relevant information about a final dataset into a summary document about that dataset. Just as processing parameters and derivation history are important to us internally, such information would also be important to external consumers of the data. However, as long as such information is accessible to users, the system itself need not directly support the generation of data reports. So I think this documentation need is an important consideration, but a non-requirement for now. See the data quality report entry in the Glossary section.

There are a few examples of data quality reports published on the WWW with SSSF project datasets, though the Web forms do not include all of the plots in the printed report. The data plots and graphical summaries should also be considered vital parts of the dataset documentation.

It might be nice to be able to associate externally generated documentation, such as complicated (e.g., hard to generate) graphs, with the data. Once attached, anyone else accessing the data could take advantage of the existing information. Someone looking for data with specific characteristics can refer to the documentation, likewise users involved in shared analysis or quality control can pass important ancillary with the data, which arguably is where it belongs and is most useful.

The sounding analyses generated by SUDS as in the SCMS Skew-T and Analysis Archive on the Web are good examples of external documentation, generated automatically by a tool, augmenting instrument data.

Tool Tracking

The thought of SPlus plotting scripts and external analysis tools also implies a need to track tool and algorithm versions. The audit trail of data should always include not the just the name of any programs or algorithms in the derivation history but also the particular revision of those programs. For example, programs and scripts should be kept under revision control, and the revision number should be recorded with the data produced by that program. Algorithms and programs change, making the data dependent upon their particular versions. Likewise, errors or problems in the algorithms are easier to trace back from the data when the version which produced the data is known.

Customer Use Cases

You might think that an important role in the use cases would be the customer, the eventual consumer of the datasets and final products and ultimately the motivation for all of our processing. Until talking with Erik I hadn't considered whether or how a customer role should be involved. So I added a Customer role and a use case diagram for customer data requests. (See Figure 1-3.)

Use Cases

Here are the use cases documented so far. These should be a superset of the scenarios mentioned in the IDPF document.

Actors

Here are the actors identified so far, and the roles of each.

Data Manager

The data manager stores and retrieves data, catalogs data from instruments, handles requests for data, performs or oversees quality control processing, and keeps track of data on and off the system.

The following figure depicts a use case diagram from the data manager's perspective.

Technician

The technician focuses on the instruments themselves, mostly concerned with the data for the purpose of improving the performance of the instrument. This includes generating calibration information, often in collaboration with a scientist. A technician performs simple analyses or visualizations of data to assess the performance and operational status of an instrument.

Scientist

A scientist needs to analyze data, both with routine integrated tools and with external, more specialized or more powerful applications. As algorithms are tested, intermediate and result data need to be stored for retrieval and further analysis. The scientist also has interest in the performance of the instrument, including the performance of processing and derivation algorithms. Lastly, a scientist may cooperate in the generation of calibration or quality control parameters.

The use case diagram below highlights the scientist's perspective.

Integrated Tool

Programmers and scientists write programs to operate on the data, so these programs need a convenient, consistent application interface to retrieve, access, and store the data. I'm considering the programs as actors on the system, playing a role called integrated tool. It's debatable whether the real actors are the programmers and scientists, but I think the role warrants inclusion here.

Programmer

The programmer develops applications in a specific instrument or processing domain which need access to data in the system. Sometimes a programmer works closely with a scientist in implementing specific algorithms, sometimes it is the scientist alone which plays the role of programmer.

Customer

The Customer is a meteorologist, researcher, or investigator needing access to data collected by one of our instruments. I do not yet know any details of the data requests: whether customers ever want anything less than an entire dataset, whether they want to interactively select the data they want from us or do their selection with the whole dataset at home, whether they want to combine requests for datasets from multiple projects, how much assistance or advice they expect in selecting or interpreting the dataset, and so on. I forgot to ask these details of Erik. The ATD Software Task Force report has some relevance here, especially the survey results regarding expected or desired data formats.

Note

The Customer role and related use cases needs more information.

Use Cases

The following sections describe the use cases, many of which can be found on the main use case diagram below.

Identify Data

This use case encapsulates the real-world concept of distinguishing like data by their derivation history, QC algorithms, instrument source, different versions of the same algorithms, meteorlogical cases, and any other number of categories which scientists, data managers, and other users may use to separate data. In the real world, the identification often takes the form of different data directories for the actual data files, or perhaps merely different data file names. The profiler data system should encapsulate data location and storage format so that they are transparent to users and applications, thus the real world directory and file paths are given the analysis concept "data identification". In the design this is realized as a "data path".

The semantics of data identification should not be imposed by the system. The system merely needs to offer a flexible and descriptive means of associating data with an identity whose semantics are completely up to the user.

Store Data

Actors: Data Manager, Scientist, Integrated Tool

The data must be stored; it must be persistent and accessible after storage until explicitly removed by a Data Manager. Ideally users do not need to worry about the internal format or location, as the data will be distinguishable and selectable by an identification assigned by and meaningful to the user.

Access Data

Uses: Browse (or Query) Data

This is the generic function of accessing data. Once data have been browsed or selected, it must be retrievable from the system. I'm not sure this should be a real use case, except that it seems useful to identify it at least as an internal use case. It is natural to think of the system as accessing data and the more real-world use cases need to make use of that use case, whether for an integrated tool or the export of data.

Remove Data

Remove data from the system. This might be complicated by the dependency history kept in derived data, in that removals should first allow some check for data have been derived from the data selected for removal.

Ingest Original Data

Actors: Data Manager

Uses: Store Data

The Data Manager receives data from an ISS, probably from the field for a particular project, but perhaps from a test deployment. The data are stored in files of various formats. In the case of the ISS, the data are in the POP format. The Data Manager runs a program which interprets the file format and stores the data in the system. The data have a known structure, associated with the particular instrument source. Other attributes include the instrument, its deployment location, the identity of the Data Manager ingesting the original data, any notes attached to the data, and optionally an associated project. This use case distinguishes between original data and derived data. Original data enter the system without any history information, as if the data were spontaneous and without predecessors. Derived data in the system have a history and have associations with other data, whether original or themselves derived.

Why the distinction between ingesting original data and storing data if the only difference is the existence of a derivation history? I can think of a couple other differences:

  • Original data, at least in the usual case of data collected from field experiments, will have a common and perhaps mandatory set of attributes which need to be specified to anchor the data tracking and auditing aspect of the system. For example, what good does it do to carry attributes and derivation history through the system if it is based on incomplete information from the start?

  • There is also the aspect of external file formats. The Store Data use case encapsulates the need to store data in an accessible way, regardless of underlying storage format. Before instrument data can be stored, they will need to be translated from some external, probably unrelated, format, which is usually a significant task. More importantly for this analysis, the translation is a significant step to the user.

I'm not certain that the reasons above justify a separate use case for ingesting original data. However, the overriding rationale might be whether most users looking at the system would consider this case a common and identifiable part of their use of the system, and that sounds likely.

Lastly, note that this use case explicitly uses the Store Data use case, hopefully indicating that whatever needs are shared between the use cases, those needs can be shared by delegation to the store data use case.

Another question is how this use case differs from importing data, that is, data exported in a Export Data use case for an external tool, which subsequently need to be re-ingested into the system as derived data. Maybe ingesting original data should extend the use case for importing data.

Export Data

Uses: Access Data

The system needs to be able to export data from the system, such as for the External Analysis and the Generate Distribution use cases.

Internal (or Integrated) Analysis

Actor: Scientist, Programmer

Uses: Store Data

Diagrams: Figure 1-5

The scientist analyzes data in the system, and sometimes needs to compare them with data from outside the system. Thus the possible need for the "Ingest original data" use case. "Internal" analysis is distinguished from external analysis by the software tools used. External analysis uses tools which are not integrated with the profiler data system, and thus any input data for those tools must be exported (see the Export Use Case) in a form the external tool can understand.

Much of the processing required of the profiler data system will comprise analysis modules written to an API supplied by the system, especially for batch processing. (See "Batch processing use case".) Hence the requirement for an API.

This use case involves a scientist or programmer writing a more specialized tool or package, either to integrate an existing analysis tool or program, or to implement an algorithm or processing task which the scientists wants to run on data in the system. The programmer takes advantage of the API to write a tool which only needs to implement the specifics of the processing task and not the general functions of data management supplied by the system. As the scientist needs changes or extensions to the analysis module, the programmer complies. Note that visualization can be considered a form of analysis, but it might also be generalized enough to benefit from some consolidation of requirements. Also note that in some cases the Scientist and Programmer roles will be played by the same person.

"In the system" still needs be defined over the course of the analysis and design.

External Analysis

Actors: Scientist

Extends: Export Data

Select a set of data to be retrieved from the system and written, or published, in some public file format. netCDF is of course a good candidate for the format. There will always be external tools more suitable to various processing and analysis problems. Hence users need to write data into a file format which can read directly by external tools or at least converted into an acceptable format. The Generate Distribution use case covers the case of a data manager supplying data to other users.

Generate Distribution

Actors: Data Manager

Uses: Export Data

The data manager naturally needs to be able to use the system to generate files in some public data format in response to user requests. So far I'm not including in this use case the need to generate multiple formats, although the ability to develop such integrated tools could be construed in other use cases.

Shared Analysis

Actors: Scientist

Multiple scientist actors sharing the results of some processing as input to their own independent processing.

Assess Instrument

Actors: Technician, Scientist

The actors are technicians or instrument mentors who want to examine an instrument for consistency and correctness both interactively and with batch processing, perhaps in cooperation with a scientist. Data need to be flagged when questionable either by their own evidence or perhaps because of information in the instrument's field log. The assessment of instrument measurements yields calibration and quality control information used by actors in other use cases.

Browse (or Query) Data

Just about every use of the system requires the ability to survey and select sets of data from the system. However the data are identified (see Identify Data) and organized, it must be possible to select data without prior knowledge of the existing organization. This need exists for both users (Data Manager, Scientist, and others) and applications (Integrated Tools), though in the case of tools it is still the user which ultimately needs to be able to select data. The data values themselves do not need to be browseable, only the identity information, whether that be the type, instrument, times, or some other labeling.

From a database perspective, this use case implies the need for queries on the data. From the user perspective, I think browsing is a more intuitive term.

Analysis

Diagrams: Figure 1-6

This is a base use case for Internal Analysis, Shared Analysis, External Analysis, Visualization, and Data Editing.

Analysis simply comprises all the similar use cases which involve accessing the data, using and manipulating it, and then possibly storing derived data or edited data back into the system. Much of the use of the system follows this pattern. I'm not sure if this should be an explicit use case, but for now it seems sufficient to me rather than individually distinguishing the use cases which extend this one.

QC Processing

Actors: Data Manager

Extends: Analysis

Uses: Access Parameters

Diagrams: Figure 1-8, Figure 1-7

A data manager, possibly along with other data managers and technicians, must generate quality control information, repair or reject errors in the data, derive common or requested data measurements from raw field measurements, and ultimately generate a final dataset or final product suitable for release to researchers. This seems to be a more specific instance of the activities in the Analysis use cases. The processing follows a specific, predefined series of analyses, and it must be easily or automatically repeatable on all new data which arrive from the field. The repeatability implies the batch or background processing requirement. The processing may also include the generation of intermediate result data or "working products". The data manager needs ways to check and verify the intermediate products and the satisfactory progress of batch processes while they run.

The possibility of multiple actors sharing in the process requires that processing and auditing information be carried with the data, and all involved need to be able to easily view that information. The following use case diagram describes the shared QC processing activity.

Inventory Data

Actors: Data Manager

Uses: Browse (or Query) Data

A user playing the role of data manager needs to browse the data available through the system to find particular datasets, to determine what is in and not in the system, and to verify storage of data. This is a good use case to include reports on system usage and resources, such as disk space, since a data manager will also need that information to maintain the system.

Access Parameters

Extends: Access Data

Accessing parameters, such as options or settings for user preferences or computations, is a slightly different abstraction than accessing generic instrument data, or so I'm thinking at this point. Hence it gets its own use case.

Application Data Storage and Retrieval

In this case an application is the actor, since the application must interact with the system through some sort of API. I suppose the developer writing the application is also an actor, but not in relation to this use case.

Online Help

This would seem to be a useful use case, but it seems vague. It involves all the actors, since any of them may want information on using the system. This may be a use case which is general at this level, but becomes more specific during the refinement of the analysis, design, and implementation of each individual use case. This use case should include documentation in general, which is always a requirement though not always acknowledged as such.

Observations

Use cases are not relevant only during the elaboration. They can be carried into planning and project scheduling. According to Fowler, elaboration is complete when the use cases have been identified and an individual time estimate for implementation can be comfortably assigned to each use case. As each use case is implemented during a construction iteration, the use case provides the foundation for that iteration's analysis as well as a convenient testing scenario.

Likewise, determining the actors for a system can help identify who needs to be interviewed to verify and elaborate the use cases.

One thing I discovered in the use cases was the idea that tools and actors would need to share not just instrument data but also processing parameters. The parameters may be results of an analysis needed for other processing, or they may be user preferences, or they may be side effects. Regardless, it seems useful to extend the notion of instrument data to include parameters.

Thus the use case discovery also led to the identification of the similarities between accessing data and accessing processing parameters. It is a natural idea and not new, but now it has been explicitly documented and diagrammed: accessing parameters is merely an extension of the usual data access and should use (reuse) those facilities rather than be implemented separately.

In the derive data use case diagram, it occurred to me that external analysis can generate externally derived data, which in turn needs to be stored into the system. This required an Ingest Exported Data use case. It is similar to Ingest Original Data, except it would be useful to keep the derivation history of the external data intact and carry it back into the system, unlike original data which by definition has no derivation history. This also suggests a "check-out and check-in" mechanism for data, similar to revision control. Although I've thought simple version numbering or labeling would be a useful requirement, I think complete revision and conflict control is excessive.

The Remove Data use case did not occur to me until I wrote the brief documentation of the Store Data use case.

I experimented with some use cases and a diagram for extending the domain to real-time, field deployment applications. The attempt at a diagram for an ISS Operator actor appears below.

Conceptual Diagrams

Include class diagrams drawn from conceptual, domain perspective here.

Prototyping

Fowler suggests prototyping the "tricky parts" of the requirements during elaboration. Prototyping can help clear up misunderstandings about the problem domain and identify critical areas for the design. Also, prototyping can help assess the technological risks.

Part of prototyping, and perhaps this relates more to technological risks, should involve the examination of existing solutions from similar domains. Partial or whole solutions, or mistakes to be avoided, can be learned from the other domains. Also, evaluating other applications can suggest opportunities for designing in compatibility with existing solutions early in the process. At the very least, an examination might mean checking to see what other groups are doing within ATD and NCAR. Below is a table of related software I've thought of so far. Anyone have any additions?

Of course, sometimes the prototyping has been done for you. For this project, since we're trying to advance and improve a set of existing tools and procedures, those tools can be useful additions to the domain documentation:

Table 1-2. Existing Profiler Documentation

profiler_to_netcdfprofiler/man/profiler_to_netcdf.html/net/sssf2/profiler/doc/profiler_to_netcdf.fm
qc_mom_wwprofiler/man/qc_mom_ww.html/net/sssf2/profiler/doc/qc_mom_ww.fm
spc_to_momprofiler/man/spc_to_mom.html/net/sssf2/profiler/doc/spc_to_mom.fm
mom_to_wind_etlprofiler/man/mom_to_wind_etl.html/net/sssf2/profiler/doc/mom_to_wind_etl.fm
Processing Exampleprofiler/man/procexample.html/net/sssf2/profiler/doc/procexample.fm
Technical Noteprofiler/man/profproc.html/net/sssf2/profiler/doc/profproc.fm

Technological Risks

I have a problem first trying to keep up with all of the latest software technology and second with identifying which of those technologies would be the appropriate choice for a particular application, if any. This project is no exception. I have done one small implementation in C++. I've considered prototyping in Python. I've investigated open source CORBA implementations. I've tried to figure out whether Java would work, or whether XML would be useful for metadata. Java might be a good choice for portable user interfaces and applets, but I'm not sure about it's suitability for ingesting the POP binary format or number crunching. Same issues for Python. I'm most comfortable with writing C++ and Java.

CORBA appeals to me because it keeps lots of options open. Core parts and those subsystems which require it, such as the ingestor, can be implemented in C++, while lightweight Java interfaces or other utilities can still be written in Java. Likewise, I like the idea of using the Interface Definition Language (IDL) to specify only interface and to allow implementations and clients of those interfaces to be distributed, among heterogeneous architectures, and written in multiple languages. There is even a tool which automatically produces Web documentation from IDL files, ala javadoc. Another often advertised CORBA advantage is the ease of adapting legacy code. With several data analysis and visualization tools already in use in ATD, tools into which users have already invested time and money, perhaps the data management can be plugged into the back end without deprecating users' familiar environments. For example, ncplot, WINDS, Zebra, AVAPS, and commercial tools like IDL, Splus, and MATLAB may only need lightweight adaptor interfaces to benefit from data management provided through a CORBA service. Likewise field tools written to a particular data management interface can operate with a simpler server implementation in the field and a larger, more complicated server at home, without being recompiled.

Of course, there are alternative technologies to CORBA, such as DCOM. And Java and RMI would work as long as all the parts of the system could be written in Java.

The risks are in choosing the wrong technology, or in choosing technologies which are too difficult to apply (which would be related to the skills risks I suppose).

Note

I would like to try a small sample implementation using CORBA and Java. The results will be added to this document.

Political Risks

One political risk might be the reusability issue. It would be a shame to risk wider involvement and greater time commitments to the design process without realizing any benefit from reuse. The failure of such a project might also deter future attempts at collaboration.

Future trends in SSSF and ATD will also affect this project. The interview with Erik raised the question of how committed we are to data quality for external customers, or at least what level of quality will we decide or find feasible to support. As that level rises and falls, the system design may vary between insufficient and overkill.

Just as an observation, sometimes tools succeed or fail not by how useful or well-suited they are but by whether a few key people accept it. For a critical project, I suppose (unfortunately) it's useful to actually identify those people and actively seek their endorsement.

Skills Risks

With regard to skills risks, the book makes an interesting note about the value of mentoring. I think this subject has been discussed before in ATD with little outcome. One of the benefits of increased communication among programmers would be the knowledge of whom to seek out for help in a particular subject. I think mentoring relationships definitely exist, and in both directions for different disciplines, but I need to take some initiative