Frameworker's Basics
of Setting up and Using Error Loggers
More esoteric details may be found in Further Options for the Frameworker
The framework, outside the context of any ordinary module, should get an instance by invoking the ELadministrator::instance() method. It will then use methods of this to establish how to get context information, and to attach various ELdestination sinks.
So first the framework instantiates the ELadministrator:
#include "ErrorLogger/ELadministrator.h"
ZM_USING_NAMESPACE( zmel ) /* using zmel; */
ELadministrator * logger = ELadministrator::instance();
Next the framework provides the ELcontextSupplier, which the logger
will use when messages are logged to find out context such as run and event
numbers.
This is done by
creating a class derived from ELcontextSupplier,
and overriding the pure
virtual methods such as context().
An instance of this class is passed to
logger.setContextSupplier().
MyContextObj contx; // See below for how the derived context supplier
// looks.
logger->setContextSupplier (contx);
Providing a context supplier is optional; it provides a way for the logger
to get at the run and event numbers when an error is logged. If no supplier
is provided, no run/event information will be associated with messages.
Further details are given below in
ELcontextSupplier: How the run and event numbers are indicated.
The frameworker may also declare the name of the overall process (or job,
or node in a farm situation). When an error is logged, this name will go
into the message. This could be used to distinguish among many cooperating
nodes which share the same output for destinations.
logger->setProcess( "PCfarm81" );
Having instantiated logger and told it how to get context information, the framework must attach one or more destinations -- sinks for the messages and statistics to be sent to. An error logger with no sinks is probably useless.
The logged information can go to each destination attached to the logger. But not every message will be acted on by every destination; each is subject to thresholds and limits, as discussed below in Basics of Thresholds and Limits, and possibly by module filtering and/or debugging verbosity level.
Three classes derived from ELdestination are provided by the ErrorLogger package:
ELoutput xx(ostream &); // Log to An ostream (e.g. cerr)
ELoutput xx("xfile"); // Log to the file named.
ELstatistics stats (cerr);
Thus, instantiating a destination and attaching it to the logger leaves
the framework with an ELdestControl to use to affect that destination.
ELadministrator logger = ELadministrator::instance();
ELdestControl logfile;
logfile = logger->attach(ELoutput ( "filename.log" ));
Later, the ELdestControl may be used to control the behavior of this
destination, as in
logfile.setLimit("*", 20);
Sometimes it may be desirable to modify the behavior of a destination
from some other part of the framework or user code, where the original
ELdestControl may no longer be available. To anticipate that,
the attach() method has a signature taking an extra arguement:
a string identifying the ELdestination.
ELadministrator logger = ELadministrator::instance();
ELdestControl logfile;
logfile = logger->attach(ELoutput ( "filename.log", "myIdString" ) );
A description of how one recovers a handle to logfile ginve the
string "myIdString" appears in
Further Options for the Frameworker.
For completeness, there is a way to wipe a destination off the list
of those the logger will dispatch to:
logger->detach ( logfile );
It is recommended that ELstatistics be attached after
the ordinary destinations, so the statistics table can accurately
reflect whether any destination reflected to a given message.
The ErrorLog instance will contain a small amount of module-specific information, but most work is actually done by the ELadministrator set up by the framework. For instance, the ELadministrator will hold the list of destinations for messages sent to the log. Since there is inherently only one instance of ELadministrator, the many instances of ErrorLog will send to destinations in a coherent and unified way.
After instantiating errlog, the Module class should provide a module name,
to use when methods of this class issue an error message to the log.
Thus in the Module base class, the lines setting up the ErrorLog will
look something like:
class Module {
public:
virtual string getModuleName();
Module(string name) {
errlog.setModule(name);
};
// ...
protected:
//...
Errorlog errlog;
}
(An additional routine, setPackage(name), is identical to setModule(name).)
If this scheme is used, the classes derived from Module, which contain the physicists' code, need do nothing further to be able to issue messages to errlog.
A simpler scheme would be to have a glogbal ErrorLog named errlog, and then everybody could log through that same errlog. In that cse, however, all the messages would be assigned the same module name (unless users were to explicitly call errlog.setModule() before logging a message).
Instead, when an error is being logged, the logger asks for the run and event names at that point. It asks by invoking the context() function of the ELcontextSupplier object set up for the logger.
The frameworker should create a class derived from ELcontextSupplier and pass an instance of that class to the constructor of the ELadministrator, as shown above ("How the frameworker sets up for logging").
ELcontextSupplier defines the following simple interface:
class ELcontextSupplier {
public:
virtual ELstring context () = 0;
virtual ELstring summaryContext () = 0;
virtual ELstring fullContext () = 0;
void editErrorObj(ErrorObj & msg);
};
As seen from this, the class derived from ELcontextSupplier must
define the three virtual methods returning ELstring, though some may be
identical to others.
Although destinations are free to call for and use whichever context string
they wish, the intent of the three forms is as follows:
The context supplier is spcified via the setContextSupplier() method of ELadministrator. If this method is not invoked, no context information will be associated with error messsages.
As implied by the above chart, each ELdestination owns a threshold level (if no threshold is explicitly set, the threshold is ELzeroSeverity). Each ELdestination also owns a general limits table (indexed by severity level) and a limits table (indexed by message id). Each limits table entry contains as data the limit and the count.
ELstatistics is a particular destination. That means that filtering by severity is set up via setThreshold. In the case of an ELstatistics destination, however, setLimit has no effect: Once an error message id gets into that table, there is absolutely no cost to incrementing its count, so a limit would be useless.
The Sample Framework illustration shows how limits and thresholds are set up.
Although a limit conceptually may or may not have been set, and if no limit whatsoever applies, no messages will be throttled out, the threshold severity level always has a value. By default, in each individual destination, this threshold starts as ELzeroSeverity.
For convenience, the ELadministrator also has setThresholds() and setLimits() methods. The meaning of those is to invoke the corresponding method for every destination currently attached to the logger. (That might change some limits previously established.)
Finally, two features slightly soften the concept of throttling via a limit. The first is that a count is past its limit, action does not cease completely: If the applicable limit for a given error would be L, then if there is an excess E = count-L, instances with E/L = 2**N for any non-negative integer N, will be logged. Thus if the limit for a specific message is 5, the loggger would contain numbers 1, 2, 3, 4, 5, 10, 15, 25, 45, 85, 165, and so forth.
The second concept applies when a large bunch of errors of some type
is followed by a long time span with no such errors. At that point the count
toward the limit may automatically be reset, so that a new cluster of these
errors can again be logged. This is controled by
dest.setTimespan ( id, seconds );
where the id works the same way and uses the same wildcards as for limits.
ELouput,
ELcollected, ELerrorList
or ELstatistics) to react only to messages coming from one module:
dest.filterModule ( "Calorimetry" );The destination controlled by
dest will now only react to
error messages logged from a module whose name has been set to
setModule("Calorimetry"). (Note that the comparison is
case-sensitive.)
Other filtering behaviors, that is, limits and thresholds, will still apply to this destination, acting on those error messages which survive the module test.
Alternatively, you can set up a destination to exclude
all messages coming from one module:
dest2.excludeModule ( "MuonIdent" );
The destination controlled by dest2 will now ignore all
error messages logged from a module whose name has been set to
setModule("MuonIdent") (again the comparison is
case-sensitive.)
These routines are a special case of a more geneneral capability:
The model is that there is a `respondToModule' list and an `ignoreModule' list. At any given time, either the `respondToModule' list or the `ignoreModule'`ignoreModule' list (or neither) may be active, but not both.
Then there are two ways to call each of two methods to influence these:
dest.ignoreModule ("*");
dest.respondToModule ("name");
dest.respondToModule ("*");
dest.ignoreModule ("name");
dest.filterModule("a") has the same effect as
dest.ignoreModule("*");
dest.respondToModule("a");
dest.excludeModule("a") has the same effect as
dest.respondToModule("*");
dest.ignoreModule("a");
errlog.setDiscardThreshold (sev);
This threshold is sent for the particular instance of ErrorLog. It might normally be set to the lowest threshold of any destination; but in special circumstances one could specify that any message sent to this Errorlog with severity less than some higher threshold is to be discarded.
The meaning is that any messages sent through the usual mechanism of
errlog ( severity, id ) << items << endmsg;which have severity less than sev, will be discarded with the absolute minimum possible work. The message will not be shopped for each destinations to react to, and each operator<< will simply return, doing no work. The net effect is a speedup of null-message processing by a factor of about 30.
The discardThreshold works like other thresholds in that if a message severity is greater or equal to the threshold, it can be reacted to. The discardThreshold takes precedence over individual destinations' thresholds.
An example of usage, and demonstration of the differnce in speed, is in the file testDiscard.cc.
The idea is that this output can easily be disabled at an overall
level, set on a per-ErrorLog basis by the framework.
The fundamental notion is that of a debug "verbosity" level, which is simply
an integer.
If a message is given a level which is higher than the cuttoff chosen by the
framework for that logger, then that message will be completely (and
efficiently) ignored.
errlog.setDebugVerbosity(3);
The default, if no debug verbosity level is set for an ErrorLog,
is a level of 0:
All debug messages assigned any positive verbosity level will be ignored by
default.
The cuttoff on debugging verbosity level is controllable for each individual ErrorLog. This allows the framework to accomodate cases where verbosity needs to be high for one module but not for others.
By default, every message issued through this debug message mechanism is
assigned a severity of ELinfo and a message id of "DEBUG".
The framework can alter this for a given ErrorLog, for example:
errlog.setDebugMessages ( ELwarning, "differentID" );
The mechanism of debug verbosity levels acts very much like that of discard thresholds for the ErrorLog, in that:
logger->setAbortThreshold (ELfatal);By default, the abort threshold is ELabort.
If the physicist issues a log message with severity as at or above the abort threshold, upon completion of the log message and dispatch of that to the various destinations, the logger will terminate the job by invoking abort(). This will cause a core file (potentially useful for debugging) to be produced. No termination summary is given.
A plausible philosophy is to set the abort threshold at ELhighestSeverity, so that NO user message aborts the job. In that case, we advise periodically checking for the presence of severe errors (see "Checking on errors").
The abort() routine causes a core dump, which normally will be useful
to trace down the cause of the error. However, in some cases the source
of errors may be known, and production runs may be better off not consuming
the disk space taken up by core files. The framework has the option of
causing errors of some severity levels to invoke exit (severityLevel)
rather than abort. If exit() is invoked, a termination summary is given.
Assuming the framework had established an atexit()
handler, this would be called as the job exits.
The exit() call will not cause a core file to be produced.
logger->setExitThreshold (ELfatal);
By default, the exit threshold is ELhighestSeverity, that is,
unless an exit threshold is specificallyestablished, the automatic exit()
behavior will not occur.
The framework can provide flexibility, so that some errors will still cause an abort() while others will cause an exit(). The rules used to decide whether to abort or exit, when encountering an error message of severity s, are as follows:
A typical setup would be:
logger->setAboutThreshold (ELabort); // not really needed; this is default
logger->setExitThreshold (ELfatal);
With this setup, messages issued at ELfatal severity will
exit(), while messages issued at ELabort severity abort() and core dump.
ErrorSeverity highest = logger->checkSeverity();
if ( highest >= ELnext ) { do whatever };
This routine provides the severity level of the highest error declared
since the last checkSeverity(), or ELzeroSeverity if none have been
declared since the last check.
The summary of error statistics can be obtained as a string. It can also be sent to a destination device (as a series of summary lines); ELoutput places these lines into its associated stream to output the summary.
The summary may be obtained in several ways:
stats->summary(dest); // Sends the summary to this destination.
stats->summary(os); // Sends the whole summary string as a char*,
// to some arbitrary ostream os.
stats->summary(s&); // Sends the whole summary string to ELstring s.
Each of the above forms also has an optional last argument, containing a
title to insert into the first line of the summary.
stats->summary(dest, "summary title");
stats->summary(os, "summary title");
stats->summary(s&, "summary title");
There is an additonal means of having the summary sent to an ostream: If
the constructor of ELstatistics is supplied an ostream argument, then when
the job completes the destructor for that ELstatistics will check to see
if any information has changed since the last summary request. If so, one
final summary will automatically be sent to that ostream. (This is safe
assuming the user does not explicitly delete the ostream before the
statistics destination goes away.)
To suppress the termination summary altogether:
stats.noTerminationSummary();
The format of the summary information is that of three parts:
To illustrate how a summary would look:
%ERLOG =============== SUMMARY === summary title ==========================
Process job123a
type message id sev module subroutine count total
---- -------------------- - ----------------- ---------------- ----- -----
1 Bank Error E central tracker refine_candidate 97 97
2 Bank Error E central tracker prepare_sigma 5 5
3 Bank Error E muon tracker prepare_sigma 5 5
4 Energy overflow S CALhadron sanityChecks 1 1
5 Too many tracks w track matching start_seeds 11* 11
6 Too many stray trax w CTKtrack countTracks 32 32
7 Too many stray trax E CTKtrack checkTracks 4 4
* Some occurences off this message were suppressed in all logs, due to limits.
type message id Examples: run/evt run/evt run/evt
---- ------------------- ------- ------- -------
1 Bank Error 14231/54101 14231/54103 14239/59350
2 Bank Error 14231/54215 14232/55506 14238/58190
3 Bank Error 14231/54200 14232/55606 14239/59195
4 Energy overflow 14235/60305
5 Too many tracks 14231/54101 14231/54506 14238/59190
6 Too many stray trax 14231/54164 14231/54171 14239/59195
7 Too many stray trax 14231/54100 14231/54191 14239/60012
Severity Number of occurences Total
-------- -------------------- ---------
.. 275984 512902
SUCCESS 21000 42000
INFO 2100 4200
WARNING 325 604
ERROR 4 6
ERROR! 1 1
SEVERE 1 1
stats.clearSummary();This clears both the individual statistics (kept by message ID) and the counts for the various severity levels. It would normally be used after having invoked stats.summary() in some form to send the information somewhere.
clearSummary() does not zero the aggregate counts for individual error IDs or for severity levels. It also does not affect any limits set, or wipe the knowledge of the message IDs from the statistics tables. (Thus there would remain a bunch of errors with counts of zero; when statistics are output these do not appear.)
There is a more sweeping method:
stats.wipe();
This clears everything -- counts and aggregate counts for individual ID's
and for severity levels. The statistics is wiped clean, which may be
relevant if memory management issues have imposed a limit on the number of
message ID entries in the statistics table.
The same routine also wipes out any information in the limits table. This
includes values which have been supplied by setLimit() (or setTimespan()),
and counts of the individual instances of each message ID. For the
ELstatistics destination this is moot (since limits do not apply) but for
ELoutput or ELsaveBuffer this may be useful when a maximum number of
message ID entries in the limits table has been imposed.
dest.wipe();
logger.wipe(); // Applies wipe() to all attached destinations
This clears everything -- counts and aggregate counts for severity levels
and for individual ID's, as well as any limits established. (This includes
the limits for "*" all messages.) The table is wiped clean, which may be
relevant if memory management issues imposes a limit on the number of
message ID entries in the statistics table.
Finally, there is a way to zero the counts going toward the limits all
IDs in a destination. This does not affact ELstatistics (which does not
use limits) and will not compactify the limits table.
It also does not affect aggregate counts.
dest.zero();