Run II Physics Analysis Software Functional Requirements

May 18, 1998

Eileen Berman, Iain Bertram, Pushpa Bhat, Frank Chlebana, Mark Edel, Sarah Eno, Irwin Gaines, Herb Greenlee, Paul Lebrun, Qizhong Li, Lee Lueking, Kaori Maeshima, John Marraffino, Pasha Murat, Larry Nodulman, Dane Skow, Kris Sliwa, Steve Vejcik, Avi Yagil, Andrzej Zieminski

This document describes the functional requirements for Physics Analysis Software for Run II. In general, these requirements should cover all software needed to access, analyze, and present, in reports and publications, data at the volumes that will need to be handled in RUN II. The requirements are organized into several categories representing the major functions of access, analysis, and presentation, with a final category dealing with usability issues. Requirements containing the word "must" are mandatory requirements, and failure to meet these would disqualify a product from consideration. Requirements containing the word "should" are desirable requirements; all other things being equal a product satisfying more of the desirable requirements would be preferred over one satisfying fewer. Each section of requirements is preceded with descriptive text giving some background justifying the specific needs.

DATA ACCESS

Data access requirements must allow data of various different formats to be retrieved for subsequent analysis in online, offline interactive and offline batch environments. The rate of access must support common online and offline uses. Events must be able to be accessed both serially and randomly, and data must be accessible with chunk sizes smaller than entire events.

It is unrealistic to expect all experiments to use a common data format (or even for a single experiment to use the same format for all stages of analysis). Data formats must support different optimizations based on different access patterns. These considerations lead to the requirements on input of foreign data formats and creation of specialized output formats.

Detailed Requirements:

Access rates (online): The tool must be able to be used in an online environment where data is being accessed in real time.
Access rates (offline): The tool must be able to access very large (at least several TB) data sets. It should be able to combine results from accessing several different data streams.
Serial vs random access: The analysis tool must be able to efficiently read a serial stream of data at at least 90% of the bandwidth provided by the storage media on which the data resides (ie, the analysis tool should not impose any significant additional overhead for serial reads). The tool must also allow random access to individual events within a larger event stream without undue overheads. The tool must support reading (and writing) data to all of the various devices in a mass storage system hierarchy.
Granularity of access: The tool must provide mechanisms for reading only a portion of an event without using up I/O bandwidth for the unread portions of the event. This may require the data to be reformatted into a specialized optimized format with some pre-knowledge of the granularity that the physicists will request.
Foreign Input and Output Formats: The analysis tool must provide a hook for user supplied conversion routines to read foreign data formats. Similarly there must be a user hook allowing foreign output formats.
Specialized output formats: The tool must allow data to be read in one format and written out in another. It is highly desirable for the tool to provide certain specialized formats that optimize data access bandwidth based on expected access patterns.

DATA ANALYSIS

Data analysis consists of the related processes of selecting samples of events, performing analysis on these samples by calculating various mathematical functions from the data in the selected events, allowing interactive variation both in the selection criteria and the calculations performed, preserving samples of events in specialized (optimized) formats for later re-analysis, and preserving the functions and selection criteria.

One important tool for this analysis is a scripting language which allows the physicist to specify both the selection criteria and mathematical operations to be applied to the data, and to control the overall analysis, plotting and presentation environment. Thus this scripting language must contain some of the functionality of programming languages and some of command line or menu driven control interfaces.

However, the basic requirement is that the analysis tool support a rich interactive environment that supports easy control of data access and analysis description as well as interactive development of physics algorithms, with some level of compatibility with offline code so algorithms developed with the analysis tool are usable offline, and offline code can be incorporated in analysis. It is felt that the most effective paradigm to meet this requirement is to require the analysis tool to support linking with external high level language (HLL) routines. The scripting language then does not need to be identical to any particular high level language (or subset of a language) as long as it allows basic data access, commands, simple evaluations, flow control and looping, and, most importantly, invoking of precompiled or dynamically linked high level language procedures. It is also important for the scripting language to support the offline object model for data. There is no requirement, however, for COMIS-like interactive functionality as long as the scripting language supports links to HLL routines.

It might be argued that portability and ease of use (and learning) considerations would suggest that the scripting language be identical to some existing HLL. However, it is felt that dynamic linking is a better way to support portability and offline compatibility. Even if the scripting language shares its syntax with some HLL, it will need to have many new commands to support data plotting and presentation that are not native to the HLL anyway. Moreover, the interactive scripting language will never be totally identical to the HLL on which it might be based, causing problems with new bugs and limited portability. If was therefore concluded that there is no requirement for the scripting language to be derived from some HLL, although it is recognized that when used carefully such a scripting language can have certain advantages.

Detailed Requirements:

Scripting Language:

The analysis tool must include a full featured scripting language, as commonly understood, and as outlined below.
The scripting language must have some understanding of events as objects, as opposed to some simpler structure, such as arrays of numbers. The analysis tool's object model should be compatible with standard object oriented programming languages, such as C++. Note that PAW's columnwise ntuple event model does not really meet this requirement.
The scripting language must be able to extract data (as built-in data types or sub objects) from events objects for histogramming, printing, or other processing.
The scripting language must be able to express complex mathematical expressions using event data.
The scripting language should have debugging facilities.
It must be possible to interface the scripting language to dynamically linked compiled high level languages, such C, C++, or Fortran.

User Control:

The scripting language must support all control functions necessary to specify data access and selection, sequence of operations, screen layout and plotting, fitting, etc.
Mathematical operations must be able to be interleaved with user stipulated sequences of control messages to the analysis package.
Results of preliminary, intermediate and final stages of analysis must be available to users at relevant times and in appropriate storage formats.
The scripting language must support command line recall and interacive command line editing.

Data Selection:

It must be possible to make decisions and to program selection criteria based on event data using data extracted by any of the above methods, so that only selected events are histogrammed, output, or subjected to some kind of further processing.
The analysis tool should be able to display selection criteria as text (on histograms or for printed output, etc.).

Input/Output:

The analysis tool should support its own object I/O format.
The analysis tool should include libraries that allows its own format object files to be read or written from compiled programs.
The analysis tool must be able to read or write object files in foreign formats using (user supplied) external modules.
The scripting language must be able to write selected event objects to one or more output streams based on arbitrary selection criteria.
The analysis tool should provide an object definition language and/or be able to define new object formats programmatically.
From the previous criteria, it must be possible within the scripting language to read events in one format, convert and write them out in a different format.
The analysis tool should support "virtual streaming," which means that it can tag a set of selected events, and read them back, without physically writing them to a separate output stream.

Numeric and Mathematical Functionality:

The analysis package must include accurate and precise numerical functionality, including double precision.
Analysis capabilities must be able to be applied to data presented to the front end interface as well as to subsequent renditions of the data (such as binned histograms).
Functions operating on multiple data sets (such as K-S tests of multiple histograms) must be included.
Mathematical operations must include the ability to fit data, parameterize data, and calculate statistical quantities from data using accessible and supported libraries or repositories of functions or programs.
Fitting procedures must allow user control of fitting algorithms.
Offline Compatibility:
The package must allow users to tailor the sequence of mathematical operations which will define an analysis on a set of data. Mathematical operations include both functional operations on data as well as fits to data. The source code used for the mathematical operations should be available to users.
Users must be able to include external software in their analysis. Such software must be accommodated whether written in C++ or Fortran (or other approved high level languages) as either source code or as part of object libraries.
A broad range of the functionality of the analysis package must be able to be linked into user defined C++ or Fortran (or other approved HLL) code.

Prototyping:

Control and mathematical routines must be able to be developed in ways that allow prototyping of simple versions which can later be expanded upon.
Prototyped sequences should contain as much of the full interface of an arbitrarily complex version as possible. Elements of the interface less important to user operation should be hidden.

DATA PRESENTATION

The results of data analysis must be able to be viewed interactively, saved in standard formats for presentation to colleagues and for inclusion in informal and formal publications. The analysis software needs to provide interactive tools to modify the various features of graphical presentations (colors, labels, etc), and once the user is satisfied with the presentation on a computer terminal the software needs to preserve essentially this exact image.

Detailed Requirements:

Interactive visualization: The analysis tool must provide a rich interactive environment for creating, controlling and displaying histograms, scatter plots, lego plots, and other graphical representations of the data. Functionality should be at least that traditionally used in products like PAW and Histoscope, including such things as interactive control of the look of the display (colors, labels, etc), bin size, scales, ability to overlay fits or other distributions, arrangement on the screen, etc. The configuration of graphical objects must be able to be stored to be applied later to the same or other data samples. Graphical objects must be able to be combined, compared, and otherwise processed (eg, adding or subtracting two histograms).
Presentation quality graphical output: Any of the graphical objects prepared interactively must be able to be preserved in some set of standard representations (postscript, pdf, gif or jpeg) suitable for printing offline, including in web pages, or e-mailed to collaborators. The user must not have to know in advance that a particular graph will be so preserved, but must be able to decide after having viewed (and modified as desired) the graph.
Formal publication of graphical output: Any of the graphical objects produced interactively must be able to be formatted for inclusion in formal publications. It should be easy to adjust certain parameters of the display (for example, font size of labels) to meet journal publication requirements.

USABILITY

Besides the specific functions described above, the software needs to obey certain rules to ensure it can be widely and effectively used. These include areas such as portability, performance, modularity, robustness, use of standards, etc.

Detailed Requirements:

Batch vs. interactive processing: Analysis tools must be capable of running both interactively and in batch mode. Scripts derived from an interactive session must be able to be passed to a batch job to reproduce the interactive analysis on a larger sample of events.
Sharing data structures: At user option, data (and command) structures of various types must be capable of being made available to others, with some granularity on how widely the permission is granted (for example world-wide access, experiment-wide access, or physics group-wide access). This access must be granted to files of special types of data preserved in an analysis job, to selected samples of standard format data, to analysis macros and selection criteria, and to definitions of graphical output produced by an analysis job.
Shared access by several clients: For online use, data structures (such as histograms) used for display purposes must be capable of being dynamically updated by other running processes. The data structures should be able to be shared among several jobs all having simultaneous read access to the data structure, thus allowing the plots to be viewed by several different users.
Parallel processing (using distinct data streams): The analysis system must be capable of processing large numbers of events efficiently. If a single processor is not capable of providing the required throughput, the system should support simple parallel processing where different servers analyze separate event streams, with the results being automatically combined before presentation.
Debugging and profiling: Good, robust and reliable debuggers are required in code development. Thus, the scripting language should have a debugger. This requirement is not satisfied simply by the scripting language being interactive and executed one line at a time, but must support such functionality as conditional breakpoints, etc. Likewise, profiling is particularly relevant when building large software systems. Seamless integration of the debugger/profiler is highly desirable.
Modularity (user code): The analysis system (or framework) must be able to accommodate user-written modules, so that these modules can be interactively called. These modules are written in the preferred compiled language (C, C++, or Fortran), or in the scripting language, and can be executed within the "framework". This capability can be either based on dynamical linking, pipes, RPC calls & shared memory access on UNIX systems, or similar access methods. The data structures created in the user code with the compiled language must be accessible while running the interactive scripts, from within the "framework" of the analysis tool. It is also desirable that all user-written methods or functions be accessible in an interactive session.
Modularity (system code): The routines making up the analysis package itself must be capable of being linked into offline batch processes without requiring the entire analysis framework to be included.
Access to source code: Access must be provided to any shareware or freeware software components. Some mechanism for source code access should be provided for commercial components.
Robustness: Lack of robustness falls into two categories: the first being things for which the user is responsible (pilot error), and the second being missing or faulty system resources which the user had the right to expect were present and functioning but which were, in fact, missing or broken. The first class is connected with the user's interaction with the system and suggests that the user interface must pay attention to validating the user's input before acting on it and potentially doing serious damage. Errors of this sort can and should almost always be identified, reported and perhaps even logged from within the interface. Thus the user interface should also be regarded as a sort of gate keeper, denying access to the internals of the system unless the action request is properly formed and completely valid within the current context. The second class of exceptions tends to be related to the system's management of it's resources. Simply hanging or crashing when, for instance, the event data server is unavailable is not acceptable. The analysis system should exhibit as low a level as possible of such failures.
Web based documentation: The analysis system documentation (including tutorials and examples) must be available on the world wide web.
Use of standards: Where there is an industry standard available, it should be adopted as part of the analysis package even if other HEP labs have not done so. Where there is no acceptable industry standard but some sister lab or major experiment has developed a tool that survives critical inspection, it could be adopted.
Portability: The selected analysis software must be able to run on both desktop systems and from centrally available servers. It is desirable to move the analysis task to the computers hosting the appropriate data sets. Current platforms of interest are SGI, Linux, Windows NT, Digital Unix, AIX and Solaris, with the versions of the operating systems as specified in the standard Copmuting Division support standards. The ideal package should support all of these platforms but at least one of: Linux and Windows NT and two of: SGI IRIX, Digital Unix, IBM AIX and Sun Solaris must be supported. Demonstrated ability to port the analysis code to new OS versions and platforms is a benefit.
Scalability: The analysis software must be able to gracefully scale from analysis of a handful of input data files (<10GB) to analyses run over several hundred (if not thousands) of input data files. Any optimizations based on having data sets being resident in (real or virtual) memory must be able to be disabled and cannot have tremendous impact on tool function for datasets exceeding memory capacities. The software must be configurable to enable many tens (~100) of simultaneous users on large central server machines. Machine resources (memory, CPU, network bandwidth) required by the analysis processes should be well managed and well suited to the likely configurations of central servers and desktops. It is highly desirable that there be simple facilities for running analysis jobs in parallel and then combining the individual results in a statistically correct manner.
Performance: The analysis software must be able to do simple presentations (eg. 2D histograms of files with an event mask) at disk speed (at least 3MB/s input and faster on higher performance systems). Plot manipulations and presentation changes of defined histograms must be rapid and introduce no noticeable delays to the user. Performance penalties for user supplied code (eg. routines from reconstruction code) must not be more than a factor of 2 over native (unoptimized) compiled code run standalone.
User Friendliness: Learning to utilize the software to the level of reading in a file of number pairs and plotting the result should not take a competent physicist more than 4 hours. Evaluators should be able to become proficient to the level of defining an input stream, performing a moderate selection/analysis procedure including user supplied code and producing a result suitable for presentation within 2 weeks. Manuals should be lucid, complete, affordable and available. Software application presentation and interface should be common to all supported platforms and data and macro-like-recipes must be easily exchangeable between all platforms. Support for detailed questions about internal operations of the software on data, numerical methods, API formats and requirements, and output formating (both data and plots) must be available preferably directly to the users, but at least to a moderate (~10) number of "experts" from each experiment. The software must be configurable to remember users preferences and customizations and should allow for multiple levels of customization (eg. user, working group, collaboration, Lab) for local definitions (eg. printers) and enhancements.