Recovering Root Files From Crashed HepTuple Jobs

John Marraffino
9 November 2001

From time to time, a job using the HepRootFileManager part of HepTuple will crash for some reason or other leaving the output file in an unrecoverable state. While it is not possible to recover from every conceivable abnormal job end, there are some things that can be done to allow at least partial recovery of the output file even after a resounding crash. To accomplish that, some collaboration and judgment on the part of the user is necessary. This note explains what can and, perhaps more importantly, cannot be done and what the user needs to know and do to make it all happen.

There is a member function write() in the HepFileManager class that writes all of HepTuple's in-memory objects out to the file named in the file manager constructor call. Earlier versions of the write() function did not write out quite enough auxiliary information for a successful recovery, especially for objects in subdirectories. As of the date of this note, that omission has been corrected. With this change, the strategy for recovering Root output files is as follows.

First select some quantity that measures output volume and that you can test from within your running job. This could be something like elapsed time or output file size or whatever you like. For my tests, I used the number of "events" processed. Depending on your level of confidence in the robustness of your job, select an interval in that quantity. Every time your criterion is satisfied, issue a write() and then go on about your business. Should the job crash, everything up to the most recent write() should be recoverable, where we use the word recoverable in a sense that requires some explanation about what it means and what limitations apply.

Understand that, after a crash, the output file you have been writing is wounded at least to the extent that some control information is either incomplete, incorrect or missing. When Root opens such a file, it attempts to reconstruct that information in memory. The file itself is not altered. This implies several things.

We also point out that simply inserting a periodic write() and nothing else provokes some side effects that may or may not be acceptable. Recall that, under default conditions, each write() produces a new cycle on the file. Allowing many cycles to accumulate is almost surely wasteful of disk space since all histogram-like objects will appear once per cycle, differing only in the number of entries. Since these are cumulative, only the last, i. e., highest, cycle is really interesting and the others are mostly baggage.

Some HepFileManager class constructors accept an argument you may use to say things about how to treat the output file. One of these is HepFileManager::HEP_REFRESH which tells HepTuple to overwrite the highest existing cycle rather than create a new one. Putting this together with a periodic write(), I eventually produced the following as a test. Most of the "boilerplate" has been removed to prevent the relevant stuff from getting lost in a thicket of details.

			:
			:
//  Instantiate a Root manager and specify REFRESH mode.

    string topdir = "crashola.root";
    fMan = new HepRootFileManager( topdir, HepFileManager::HEP_REFRESH );
			:
			:
//  Make some "events". Generate pseudo tracks using random numbers.

  for( Int4 j=1; j<=nevents; ++j ) {
			:
//  Do all the grunt work to define values for the physics variables.
			:
    tupl.capture("TRACKS::Ntracks",numTracks);
    tupl.capture("TRACKS::Pt" ,&(PtArray[0]),  numTracks);
    tupl.capture("TRACKS::Eta",&(EtaArray[0]), numTracks);
    tupl.capture("TRACKS::Phi",&(PhiArray[0]), numTracks);
    tupl.storeCapturedData();
    tupl.clearData();
			:
//  Update the output file every 5000 events. 
			:
    if( j%5000 == 0 ) fMan->write();

//  Force the job to abort between file updates.

    if( j == 11000 ) abort();
  }
  fMan->write();
  delete fMan;
  return 0;
}

As intended, the job wrote an output file after 5000 events, updated the file (by overwrite!) after 10000 events and crashed after 11000 events. I then started a Root browser. When attempting to open the file, TFile complained about how this file had not been properly closed and performed a recovery. On inspection, I found that both the histograms and the ntuple claimed 10000 entries, as expected - less than had been generated but a good deal more than zero.

The important point here is the periodic call to write(). Using HEP_REFRESH as well is a nicety worth considering.