The ZOOM ErrorLogger Package:

Frameworker's Basics
of Setting up and Using Error Loggers

More esoteric details may be found in Further Options for the Frameworker


How the frameworker sets up for logging

The ErrorLog objects that modules set up and that users see require that an ELadministrator exist. This encapsulates all the framework control over how logging is done; in particular, the ELadministrator holds the information as to what destinations should be used, and the way to get context information when needed.

The framework, outside the context of any ordinary module, should get an instance by invoking the ELadministrator::instance() method. It will then use methods of this to establish how to get context information, and to attach various ELdestination sinks.

So first the framework instantiates the ELadministrator:

  #include "ErrorLogger/ELadministrator.h"
  ZM_USING_NAMESPACE( zmel ) /* using zmel; */

  ELadministrator * logger = ELadministrator::instance();

Next the framework provides the ELcontextSupplier, which the logger will use when messages are logged to find out context such as run and event numbers. This is done by creating a class derived from ELcontextSupplier, and overriding the pure virtual methods such as context(). An instance of this class is passed to logger.setContextSupplier().
  MyContextObj contx;  // See below for how the derived context supplier
                       // looks.
  logger->setContextSupplier (contx);
Providing a context supplier is optional; it provides a way for the logger to get at the run and event numbers when an error is logged. If no supplier is provided, no run/event information will be associated with messages. Further details are given below in ELcontextSupplier: How the run and event numbers are indicated.

The frameworker may also declare the name of the overall process (or job, or node in a farm situation). When an error is logged, this name will go into the message. This could be used to distinguish among many cooperating nodes which share the same output for destinations.

  logger->setProcess( "PCfarm81" );

Having instantiated logger and told it how to get context information, the framework must attach one or more destinations -- sinks for the messages and statistics to be sent to. An error logger with no sinks is probably useless.

The logged information can go to each destination attached to the logger. But not every message will be acted on by every destination; each is subject to thresholds and limits, as discussed below in Basics of Thresholds and Limits, and possibly by module filtering and/or debugging verbosity level.

Three classes derived from ELdestination are provided by the ErrorLogger package:

The usual framework will use one or more ELoutput sinks, an ELstatistics, and perhaps an ELsaveBuffer. To attach a destination, the frameworker must instantiate it, and attach the destination to the ELadministrator. Attaching the destination returns an ELdestControl object, which is a handle for controlling the behavior of that destination. If no destination is attached to the ELadministrator then one is attached automatically to "cout" when the first error is logged.

Thus, instantiating a destination and attaching it to the logger leaves the framework with an ELdestControl to use to affect that destination.

  ELadministrator logger = ELadministrator::instance();
  ELdestControl logfile;
  logfile = logger->attach(ELoutput ( "filename.log" ));
Later, the ELdestControl may be used to control the behavior of this destination, as in
  logfile.setLimit("*", 20);
Sometimes it may be desirable to modify the behavior of a destination from some other part of the framework or user code, where the original ELdestControl may no longer be available. To anticipate that, the attach() method has a signature taking an extra arguement: a string identifying the ELdestination.
  ELadministrator logger = ELadministrator::instance();
  ELdestControl logfile;
  logfile = logger->attach(ELoutput ( "filename.log", "myIdString" ) );
A description of how one recovers a handle to logfile ginve the string "myIdString" appears in
Further Options for the Frameworker.

For completeness, there is a way to wipe a destination off the list of those the logger will dispatch to:

  logger->detach ( logfile );
It is recommended that ELstatistics be attached after the ordinary destinations, so the statistics table can accurately reflect whether any destination reflected to a given message.

How the Module (or Package) Sets Up errlog

The class for each module should allow its methods to log errors by containing a variable object of the type ErrorLog. This should be done by giving the base class Module, which all modules classes derive from, an instance of ErrorLog. The name of this object should be known to the various physicists coding parts of the module. We assume that this variable will be named "errlog." This should be placed in the protected section, not private, so that methods of the derived classes can make use of errlog.

The ErrorLog instance will contain a small amount of module-specific information, but most work is actually done by the ELadministrator set up by the framework. For instance, the ELadministrator will hold the list of destinations for messages sent to the log. Since there is inherently only one instance of ELadministrator, the many instances of ErrorLog will send to destinations in a coherent and unified way.

After instantiating errlog, the Module class should provide a module name, to use when methods of this class issue an error message to the log. Thus in the Module base class, the lines setting up the ErrorLog will look something like:

  class Module {
  public:
    virtual string getModuleName(); 
    Module(string name) {
      errlog.setModule(name);
    };
  // ...
  protected:
  //...
    Errorlog errlog;
  }
(An additional routine, setPackage(name), is identical to setModule(name).)

If this scheme is used, the classes derived from Module, which contain the physicists' code, need do nothing further to be able to issue messages to errlog.

A simpler scheme would be to have a glogbal ErrorLog named errlog, and then everybody could log through that same errlog. In that cse, however, all the messages would be assigned the same module name (unless users were to explicitly call errlog.setModule() before logging a message).


ELcontextSupplier: How the run and event numbers are indicated

The run and event are printed in the text of every error message. However, the framework does not have to call some setEvent() method every time a new event is started. In order to avoid overhead for every event (when errors might be logged in only a tiny fraction of those events), this information is not pre-supplied by the framework.

Instead, when an error is being logged, the logger asks for the run and event names at that point. It asks by invoking the context() function of the ELcontextSupplier object set up for the logger.

The frameworker should create a class derived from ELcontextSupplier and pass an instance of that class to the constructor of the ELadministrator, as shown above ("How the frameworker sets up for logging").

ELcontextSupplier defines the following simple interface:

  class ELcontextSupplier {
  public:
    virtual ELstring context () = 0; 
    virtual ELstring summaryContext () = 0; 
    virtual ELstring fullContext () = 0; 
    void editErrorObj(ErrorObj & msg);
  };
As seen from this, the class derived from ELcontextSupplier must define the three virtual methods returning ELstring, though some may be identical to others. Although destinations are free to call for and use whichever context string they wish, the intent of the three forms is as follows: The editErrorObj() routine will be called when the message is started. It provides a hook to modify the module or subroutine information, the id, or any part of the message other than the actual text items (which would not have been established yet). This method is optional; the editErrorObj() method of the base ELcontextSupplierI class is an inline method doing nothing.

The context supplier is spcified via the setContextSupplier() method of ELadministrator. If this method is not invoked, no context information will be associated with error messsages.


Basics of Thresholds and Limits

Some nomenclature: Thresholds and limits both apply to particular destinations. One can set a limit or a threshold for each individual destination. (using its ELdestControl). To do this, you use methods of the associated ELdestControd, which we will call "dest" for these examples. The effects are as follows:
dest.setThreshold (severity);
Supress logging or acting on messages below this severity, for this destination.
dest.setLimit (id, n);
dest.setLimit (severity, n);
dest.setLimit ("*", n);
For this destination, don't log past n instances of any given exception id matching the specified type.
In case one has established two or more applicable limits, the limit used is the most specific applicable case: Specified ID before specific severity level, and both before wildcard "*".

As implied by the above chart, each ELdestination owns a threshold level (if no threshold is explicitly set, the threshold is ELzeroSeverity). Each ELdestination also owns a general limits table (indexed by severity level) and a limits table (indexed by message id). Each limits table entry contains as data the limit and the count.

ELstatistics is a particular destination. That means that filtering by severity is set up via setThreshold. In the case of an ELstatistics destination, however, setLimit has no effect: Once an error message id gets into that table, there is absolutely no cost to incrementing its count, so a limit would be useless.

The Sample Framework illustration shows how limits and thresholds are set up.

Although a limit conceptually may or may not have been set, and if no limit whatsoever applies, no messages will be throttled out, the threshold severity level always has a value. By default, in each individual destination, this threshold starts as ELzeroSeverity.

For convenience, the ELadministrator also has setThresholds() and setLimits() methods. The meaning of those is to invoke the corresponding method for every destination currently attached to the logger. (That might change some limits previously established.)

Finally, two features slightly soften the concept of throttling via a limit. The first is that a count is past its limit, action does not cease completely: If the applicable limit for a given error would be L, then if there is an excess E = count-L, instances with E/L = 2**N for any non-negative integer N, will be logged. Thus if the limit for a specific message is 5, the loggger would contain numbers 1, 2, 3, 4, 5, 10, 15, 25, 45, 85, 165, and so forth.

The second concept applies when a large bunch of errors of some type is followed by a long time span with no such errors. At that point the count toward the limit may automatically be reset, so that a new cluster of these errors can again be logged. This is controled by

  dest.setTimespan ( id, seconds );
where the id works the same way and uses the same wildcards as for limits.


Filtering by Module or Multiple Modules

You can set up a destination (ELouput, ELcollected, ELerrorList or ELstatistics) to react only to messages coming from one module:
  dest.filterModule ( "Calorimetry" );
The destination controlled by dest will now only react to error messages logged from a module whose name has been set to setModule("Calorimetry"). (Note that the comparison is case-sensitive.)

Other filtering behaviors, that is, limits and thresholds, will still apply to this destination, acting on those error messages which survive the module test.

Alternatively, you can set up a destination to exclude all messages coming from one module:

  dest2.excludeModule ( "MuonIdent" );
The destination controlled by dest2 will now ignore all error messages logged from a module whose name has been set to setModule("MuonIdent") (again the comparison is case-sensitive.)

These routines are a special case of a more geneneral capability:

The model is that there is a `respondToModule' list and an `ignoreModule' list. At any given time, either the `respondToModule' list or the `ignoreModule'`ignoreModule' list (or neither) may be active, but not both.

Then there are two ways to call each of two methods to influence these:

  dest.ignoreModule ("*");
  dest.respondToModule ("name");
  dest.respondToModule ("*");
  dest.ignoreModule ("name");
The filter and exclude routines are expressed in terms of ignoring and responding to modules:

dest.filterModule("a") has the same effect as

  dest.ignoreModule("*"); 
  dest.respondToModule("a"); 
dest.excludeModule("a") has the same effect as
  dest.respondToModule("*"); 
  dest.ignoreModule("a"); 


Rapid Discarding Below a Severity

The frameworker can short-ciruit the formatting, timestamping, and "shopping" the message to each destination that occurs when a low-severity error message is sent:
  errlog.setDiscardThreshold (sev);

This threshold is sent for the particular instance of ErrorLog. It might normally be set to the lowest threshold of any destination; but in special circumstances one could specify that any message sent to this Errorlog with severity less than some higher threshold is to be discarded.

The meaning is that any messages sent through the usual mechanism of

   errlog ( severity, id ) << items << endmsg;
which have severity less than sev, will be discarded with the absolute minimum possible work. The message will not be shopped for each destinations to react to, and each operator<< will simply return, doing no work. The net effect is a speedup of null-message processing by a factor of about 30.

The discardThreshold works like other thresholds in that if a message severity is greater or equal to the threshold, it can be reacted to. The discardThreshold takes precedence over individual destinations' thresholds.

An example of usage, and demonstration of the differnce in speed, is in the file testDiscard.cc.


Establishing Debug Verbosity Levels

The ErrorLogger package supports a separate syntax for issuing information intended for debugging (see "Issuing debuggging information messages" ).

The idea is that this output can easily be disabled at an overall level, set on a per-ErrorLog basis by the framework. The fundamental notion is that of a debug "verbosity" level, which is simply an integer. If a message is given a level which is higher than the cuttoff chosen by the framework for that logger, then that message will be completely (and efficiently) ignored.

  errlog.setDebugVerbosity(3);
The default, if no debug verbosity level is set for an ErrorLog, is a level of 0: All debug messages assigned any positive verbosity level will be ignored by default.

The cuttoff on debugging verbosity level is controllable for each individual ErrorLog. This allows the framework to accomodate cases where verbosity needs to be high for one module but not for others.

By default, every message issued through this debug message mechanism is assigned a severity of ELinfo and a message id of "DEBUG". The framework can alter this for a given ErrorLog, for example:

  errlog.setDebugMessages ( ELwarning, "differentID" );

The mechanism of debug verbosity levels acts very much like that of discard thresholds for the ErrorLog, in that:

One distinction is that since the level can be any integer, the granularity availble is not restricted by the limited number of severity levels.


Controlling abort and exit() behavior

The higher severity levels are used by physicists to indicate that they consider this error so severe as to warrant terminating the job. However, ultimate control of this rests in the hands of the framework. The control is associated with the logger (the ELadministrator) rather than any specific destination.
  logger->setAbortThreshold (ELfatal);
By default, the abort threshold is ELabort.

If the physicist issues a log message with severity as at or above the abort threshold, upon completion of the log message and dispatch of that to the various destinations, the logger will terminate the job by invoking abort(). This will cause a core file (potentially useful for debugging) to be produced. No termination summary is given.

A plausible philosophy is to set the abort threshold at ELhighestSeverity, so that NO user message aborts the job. In that case, we advise periodically checking for the presence of severe errors (see "Checking on errors").

The abort() routine causes a core dump, which normally will be useful to trace down the cause of the error. However, in some cases the source of errors may be known, and production runs may be better off not consuming the disk space taken up by core files. The framework has the option of causing errors of some severity levels to invoke exit (severityLevel) rather than abort. If exit() is invoked, a termination summary is given. Assuming the framework had established an atexit() handler, this would be called as the job exits. The exit() call will not cause a core file to be produced.

  logger->setExitThreshold (ELfatal);
By default, the exit threshold is ELhighestSeverity, that is, unless an exit threshold is specificallyestablished, the automatic exit() behavior will not occur.

The framework can provide flexibility, so that some errors will still cause an abort() while others will cause an exit(). The rules used to decide whether to abort or exit, when encountering an error message of severity s, are as follows:

A typical setup would be:

  logger->setAboutThreshold (ELabort); // not really needed; this is default
  logger->setExitThreshold (ELfatal);
With this setup, messages issued at ELfatal severity will exit(), while messages issued at ELabort severity abort() and core dump.


Checking on errors

Since severity levels above ELnext indicate advice that the flow of processing ought to be modified or ceased, it is advisable (in all but the most time-critical applications) that the framework check for the presence of such errors, between invoking modules or between events. To make this inexpensive, a routine checkSeverity() is provided.
	ErrorSeverity highest = logger->checkSeverity();
	if ( highest >= ELnext ) { do whatever };
This routine provides the severity level of the highest error declared since the last checkSeverity(), or ELzeroSeverity if none have been declared since the last check.


Obtaining error statistics summaries

Assuming the frameworker has attached an instance of the provided ELstatistics destination to the logger, statistics summaries may be obtained. ELstatistics has no visible reaction to being sent an error message, but places information into a table, which can be formatted and output. For the purposes of the illustrations below, we will assume that the logger has attached an ELstatistics destination, with an ELdestControl called "stats."

The summary of error statistics can be obtained as a string. It can also be sent to a destination device (as a series of summary lines); ELoutput places these lines into its associated stream to output the summary.

The summary may be obtained in several ways:

  stats->summary(dest); // Sends the summary to this destination.
  stats->summary(os);   // Sends the whole summary string as a char*, 
                        // to some arbitrary ostream os.
  stats->summary(s&);   // Sends the whole summary string to ELstring s. 
Each of the above forms also has an optional last argument, containing a title to insert into the first line of the summary.
  stats->summary(dest, "summary title");
  stats->summary(os,   "summary title");
  stats->summary(s&,   "summary title");
There is an additonal means of having the summary sent to an ostream: If the constructor of ELstatistics is supplied an ostream argument, then when the job completes the destructor for that ELstatistics will check to see if any information has changed since the last summary request. If so, one final summary will automatically be sent to that ostream. (This is safe assuming the user does not explicitly delete the ostream before the statistics destination goes away.)

To suppress the termination summary altogether:

  stats.noTerminationSummary();

The format of the summary information is that of three parts:

To illustrate how a summary would look:


%ERLOG  ===============  SUMMARY  ===  summary title  ==========================

Process job123a
 type  message id         sev   module         subroutine        count    total
 ---- -------------------- - ----------------- ----------------  -----    -----
    1 Bank Error           E central tracker   refine_candidate     97       97
    2 Bank Error           E central tracker   prepare_sigma         5        5
    3 Bank Error           E muon tracker      prepare_sigma         5        5
    4 Energy overflow      S CALhadron         sanityChecks          1        1
    5 Too many tracks      w track matching    start_seeds          11*      11
    6 Too many stray trax  w CTKtrack          countTracks          32       32
    7 Too many stray trax  E CTKtrack          checkTracks           4        4

* Some occurences off this message were suppressed in all logs, due to limits.

 type   message id   Examples:  run/evt         run/evt          run/evt
 ---- -------------------       -------         -------          -------
    1 Bank Error             14231/54101      14231/54103      14239/59350     
    2 Bank Error             14231/54215      14232/55506      14238/58190     
    3 Bank Error             14231/54200      14232/55606      14239/59195     
    4 Energy overflow        14235/60305    
    5 Too many tracks        14231/54101      14231/54506      14238/59190     
    6 Too many stray trax    14231/54164      14231/54171      14239/59195     
    7 Too many stray trax    14231/54100      14231/54191      14239/60012

 Severity	Number of occurences	  Total
 --------	--------------------	---------
    ..			275984		  512902
  SUCCESS	         21000             42000
  INFO			  2100		    4200
  WARNING		   325		     604
  ERROR			     4		       6
  ERROR!		     1		       1
  SEVERE		     1		       1


Clearing statistics and/or limits

The method to clear the counts in the statistics being kept in the destination whose ELdestControl is "stats" is
  stats.clearSummary();
This clears both the individual statistics (kept by message ID) and the counts for the various severity levels. It would normally be used after having invoked stats.summary() in some form to send the information somewhere.

clearSummary() does not zero the aggregate counts for individual error IDs or for severity levels. It also does not affect any limits set, or wipe the knowledge of the message IDs from the statistics tables. (Thus there would remain a bunch of errors with counts of zero; when statistics are output these do not appear.)

There is a more sweeping method:

  stats.wipe();
This clears everything -- counts and aggregate counts for individual ID's and for severity levels. The statistics is wiped clean, which may be relevant if memory management issues have imposed a limit on the number of message ID entries in the statistics table.

The same routine also wipes out any information in the limits table. This includes values which have been supplied by setLimit() (or setTimespan()), and counts of the individual instances of each message ID. For the ELstatistics destination this is moot (since limits do not apply) but for ELoutput or ELsaveBuffer this may be useful when a maximum number of message ID entries in the limits table has been imposed.

  dest.wipe();
  logger.wipe();     // Applies wipe() to all attached destinations
This clears everything -- counts and aggregate counts for severity levels and for individual ID's, as well as any limits established. (This includes the limits for "*" all messages.) The table is wiped clean, which may be relevant if memory management issues imposes a limit on the number of message ID entries in the statistics table.

Finally, there is a way to zero the counts going toward the limits all IDs in a destination. This does not affact ELstatistics (which does not use limits) and will not compactify the limits table. It also does not affect aggregate counts.

  dest.zero();

ErrorLogger Package Page


Mark Fischler
Last modified: Thu Mar 15 2001