Writing ZOOM Ntuples
A Guided Tour

Updated
12 Nov. 2001

Several people have reported problems while attempting to write ntuple files via HepTuple. This note is an attempt to remove some of the confusion via a heavily annotated example. In particular, we are using large fragments from the file hepToyz.cc as found in the HepTuple/examples directory of the ZOOM distribution. As the name suggests, that program is a toy but, at the same time, attempts to do several of the things a real analysis program might do.

In the spirit of a simple Monte Carlo program, hepToyz.cc chooses a number of tracks (limited to the range between 2 and 12) out of a flat random distribution and, for each track, generates values for Pt, Eta and Phi from other distributions with reasonable shapes. None of that is of any importance here except that it serves as a source of semi-realistic "data" for filling histograms and ntuples. The most important feature of these tracks is that their number varies from event to event, as would be the case with real data.

After clearing out a lot of stuff that is irrelevant for our purposes here, the front end of hepToyz.cc looks like the following fragment.


#define MAXTRACKS 12

//  Declare the pointer to the file manager at this level to
//  make it available at file scope.

HepFileManager *tManager;

Int4 main(Int4 argc, char** argv) {
  Int4 nevents=2500;
  bool first;
  Int4   numTracks;
		: 
		: 
  Float8 Pt, Eta, Phi;
  Float8 Px, Py, Pz;
  Float8 PtArray[MAXTRACKS];
  Float8 EtaArray[MAXTRACKS];
  Float8 PhiArray[MAXTRACKS];
  Float8 PxArray[MAXTRACKS];
  Float8 PyArray[MAXTRACKS];
  Float8 PzArray[MAXTRACKS];

  if( first )  tManager->mkdir("junk");
  tManager->cd("junk");

//  Instantiate an ntuple, declare it to be columnwise (this is default
//  anyhow) and tell the manager where to find the data items when we tell
//  it to capture. This is a handy way to do this.  We establish a reference
//  to the data item here and give the item a value when the program logic
//  warrants it. After all items have been assigned a value, we tell the
//  ntuple machinery to capture everything. This works because the location
//  of the data doesn't change - just the value sitting in that location.
//  There's a lot of action here, so we walk through it carefully.

//  All the uproar with " p = first
//	 		    ? tManager-> ...
//			    : tManager-> ... "
//  is a trick to switch between a new run and a continuation run on the
//  basis of a single switch variable and still keep everything in scope.
//  A simple
//            if(first) {
//              HepHist1D & ppt = tManager-> hist1D(...
//                     :
//            } else {
//              HepHist1D & ppt = tManager-> retrieveHist1D(...
//                     :
//            }
//  doesn't do it because the { }'s define a scope for the declarations
//  inside, but it's the wrong scope. Consequently, we resort to this
//  monstrosity. Note also that we define tupl as a reference rather than
//  a pointer to an ntuple object. Choosing to use a pointer would require
//  changes to the syntax for the rest of the program.

  HepNtuple & tupl = first
                   ? tManager->ntuple("Kinematics_CWN")
                   : tManager->retrieveNtuple("Kinematics_CWN");

  tupl.setColumnWise();			// These are both default but I like
  tupl.setDiskResident();		// to say it specifically anyhow.

//  N. B. In addition to defining the ntuple block names and column names,
//        these declaration also indirectly declare the data types of the
//        corresponding items, either through the type of the data item,
//        the type of the default value or the type of the referenced item.
//        This is fairly subtle and needs some care.

  if( first ) {
//  Define a block named TRACKS and a column named Ntracks.  In addition,
//  declare this column to contain an index variable and specify its range.

    tupl.columnAt("TRACKS::Ntracks",&numTracks,(Int4)2); // Supply default value
    tupl.span("TRACKS::Ntracks",0,MAXTRACKS);	         // Specify a nametag

//  Declare a column array named Pt and say that it is a variable length
//  array indexed by Ntracks and has maximum dimension MAXTRACKS.  Repeat
//  this for each column within the Tracks block. Note that we specifically
//  cast the default value as a Float8.  This determines the type of the data.

//  There are a number of variations on how to declare indexed column arrays.
//  We use several of them here to demonstrate some of the possibilities.

    tupl.columnArray( "TRACKS::Pt", 1, (Float8)99.9 );
    tupl.dimension( "TRACKS::Pt", MAXTRACKS );	// Specify the nametag
    tupl.index( "TRACKS::Pt", "Ntracks" );

//  Do the same things for the other five column arrays.  Note that we need
//  to declare default values, dimensions and index variables for each column!

//  Note that once a nametag (e. g., TRACKS::Pt) has been declared, it does not
//  need to be repeated. It remains the default value till explicitly changed.

    tupl.columnArray( "TRACKS::Eta", 1, (Float8)99.9 );
    tupl.dimension( MAXTRACKS );		// Default nametag 
    tupl.index( "TRACKS::Eta", "Ntracks" );	// Specify nametag

    tupl.columnArray( "TRACKS::Phi", 1, (Float8)99.9 );
    tupl.dimension( "TRACKS::Phi", MAXTRACKS );	// Specify nametag
    tupl.index( "TRACKS::Phi", "Ntracks" );	// Specify nametag

    tupl.columnArray( "TRACKS::Px", 1, (Float8)99.9 );
    tupl.dimension( MAXTRACKS );		// Default nametag
    tupl.index( "Ntracks" );			// Default nametag 

// Two-fold chain since each method returns a reference

    tupl.columnArray( "TRACKS::Py", 1, (Float8)99.9 )
        .dimension( MAXTRACKS );
    tupl.index( "TRACKS::Py", "Ntracks" );

// Three-fold chain since each method returns a reference

    tupl.columnArray( "TRACKS::Pz", 1, (Float8)99.9 )
        .dimension( MAXTRACKS )
        .index( "Ntracks" );

  } else {
		:
	Code to deal with a RE-instantiation of everything
		:
  }

//  Finally, print out the block descriptor string that this produced.
//  This is a handy way to see if HepTuple really understood what we are
//  trying to do.  Note that, for all managers, the output descriptor is
//  in HBook format.

  cerr << "\nPrint the Format string for Block TRACKS." << endl;
  cerr << tupl.blockFormat( "TRACKS" ) << endl;

//  Release everything so we can define values for all quantities.

  tupl.clearDataBlock( "TRACKS" );
		:
		:

Up to this point, all we've accomplished is a detailed declaration of the ntuple columns and all of their characteristics. At this point, it might be instructive to show the block format string that got constructed for us. Other than being broken onto two lines so we could see all of it, that string is


    Ntracks:I*4::[0,12],Eta(Ntracks):R*8,Phi(Ntracks):R*8,Pt(Ntracks):R*8,
    Px(Ntracks):R*8,Py(Ntracks):R*8,Pz(Ntracks):R*8

and, had we had the wit to construct this on our own, we could have used it in a call to block(...) and done the whole job in one stroke except for defining default values.

To actually put something into the ntuple goes like the following fragment.


//  Make some "events". Generate pseudo tracks using random numbers.

  for( Int4 j=0; j<nevents; ++j ) {

//  Generate some pseudo tracks.

//  For the current "event" choose a number of tracks between 2 and MAXTRACKS-2.
//  We're going to throw in the last two tracks later.

    numTracks = (Int4)Rflt.fireInt( 2, MAXTRACKS-1 );

//  For each track, generate a pt and an eta. Transform those into (px,py,pz) 

    for( Int4 k=0; k<numTracks; ++k )  {
		:
    Do things to assign values to Pt, Eta, Phi, Px, Py, and Pz
		:

//  Catch the data for the ntuple arrays, store it away 

      PtArray[k] = Pt;
      EtaArray[k] = Eta;
      PhiArray[k] = Phi;
      PxArray[k] = Px;
      PyArray[k] = Py;
      PzArray[k] = Pz;
		:
		:
    }

//  Finished making tracks for this event. Make some histogram and ntuple
//  entries. Capture the columns one by one before they go out of scope,
//  tell HepTuple to push everything out to the file, clear everything and
//  go do it all again.  Note that we explicitly give the length (for the
//  current event) for the variable-length arrays.

    tupl.capture( "TRACKS::Ntracks", numTracks );
    tupl.capture( "TRACKS::Pt",  &(PtArray[0]), numTracks );
    tupl.capture( "TRACKS::Eta", &(EtaArray[0]), numTracks );
    tupl.capture( "TRACKS::Phi", &(PhiArray[0]), numTracks );
    tupl.capture( "TRACKS::Px",  &(PxArray[0]), numTracks );
    tupl.capture( "TRACKS::Py",  &(PyArray[0]), numTracks );
    tupl.capture( "TRACKS::Pz",  &(PzArray[0]), numTracks );
    tupl.storeCapturedData();
    tupl.clearData();
		:
		:
  }

A minor generalization on this, useful in the case where we want to lay on a cut of some sort and don't know a priori how many elements of the arrays we really want to store, would look like the following. We require that Pt be greater than 10 GeV/c.


//  Make some "events". Generate pseudo tracks using random numbers.

  for( Int4 j=0; j<nevents; ++j ) {

//  Generate some pseudo tracks.

//  For the current "event" choose a number of tracks between 2 and MAXTRACKS-2.
//  We're going to throw in the last two tracks later.

    numTracks = (Int4)Rflt.fireInt(2,MAXTRACKS-1);

//  For each track, generate a pt and an eta. Transform those into (px,py,pz) 

    Int4 keep = 0;
    Float8 Ptmin = 10.0;
    for( Int4 k=0; k<numTracks; ++k )  {
		:
    Do things to assign values to Pt, Eta Phi Px, Py, and Pz
		:

//  Lay on the Pt cut and store the survivors.

      if( Pt > Ptmin ) {
        PtArray[keep] = Pt;
        EtaArray[keep] = Eta;
        PhiArray[keep] = Phi;
        PxArray[keep] = Px;
        PyArray[keep] = Py;
        PzArray[keep] = Pz;
        ++keep;
      }
    }
		:
		:
//  Finished making tracks for this event. If anything survived the cut,
//  make some ntuple entries. Capture the columns one by one before they
//  go out of scope, tell HepTuple to push everything out to the file,
//  clear everything and go do it all again.

      if( keep > 0 ) {
        tupl.capture("TRACKS::Ntracks", keep );
        tupl.capture("TRACKS::Pt",  &(PtArray[0]),  keep );
        tupl.capture("TRACKS::Eta", &(EtaArray[0]), keep );
        tupl.capture("TRACKS::Phi", &(PhiArray[0]), keep );
        tupl.capture("TRACKS::Px",  &(PxArray[0]),  keep );
        tupl.capture("TRACKS::Py",  &(PyArray[0]),  keep );
        tupl.capture("TRACKS::Pz",  &(PzArray[0]),  keep );
        tupl.storeCapturedData();
        tupl.clearData();
  }
			:
			:
  }

Together, these two code fragments show how to establish an ntuple, define all of the characteristics of the columns and, finally, how to go about filling those columns and storing the data for later analysis. We emphasize that this is only a subset of the things one may do, but a particularly useful subset. Having digested this much, we can go on to some additional observations, generalizations and two warnings.

First, note our use of columnArray, especially the second argument. The header file says that the second argument is "ncolumns" - the number of columns to reserve for this array. Why then, did we get away with putting a "1" there? The answer is that a subsequent call to dimension supersedes whatever we tell columnArray about the size of the array. In the special case of a 1-dimensional array, we could just as well have written, for instance


    tupl.columnArray( "TRACKS::Pt", MAXTRACKS, (Float8)99.9 );
    tupl.index( "TRACKS::Pt", "Ntracks" );

and all would have worked nicely. What then is the point of the dimension call? A call to dimension is required if the array we want to declare has more than one dimension. It is my particular taste to always use the general form for these declarations. You, of course, are not bound by my eccentricities and may freely use the "short form" as you choose - but only for 1-dimensional column arrays.

Having introduced the notion of arrays of more than one dimension, there is one other observation that needs to be made and a (temporary, we hope) warning. First the observation.

For discussion, we assume that it is desired to use a 2-dimensional array of columns. That array's size may be either variable, as above, or fixed. The following discussion applies in both cases. Bilingual users already know that C/C++ and Fortran have opposite conventions about array storage with C/C++ using row-major storage and Fortran using column-major storage. Since we have attempted to support both HBook and Root as file managers, we had to deal with arrays in both storage modes and that led to a compromise. Since you write your code in C++, you should declare your arrays in the "natural" way for that language - i. e., with the last index varying most rapidly. Thus, if we had chosen to combine the Px, Py and Pz arrays of the example code above into a single, 2-dimensional array, PxPyPz, and Pt, Eta and Phi into a single 2-dimensional array, PtEtaPhi, we would declare those arrays as


  Float8   PxPyPz[MAXTRACKS][3];
  Float8 PtEtaPhi[MAXTRACKS][3];

Then, independent of the choice of manager, the declaration of column characteristics would look like

  tupl.columnArray( "TRACKS::Ptep", dummy, (Float8)99.9 );
  tupl.dimension( 3, MAXTRACKS );	// Note order of dimensions!
  tupl.index( "Ntracks" );

  tupl.columnArray( "TRACKS::Pxyz", dummy, (Float8)99.9 );
  tupl.dimension(  3, MAXTRACKS );	// Note order of dimensions!
  tupl.index( "Ntracks" );

and we see that the compromise is that in the call to dimension, the arguments must be presented in the order opposite to how the arrays were declared, i. e., in Fortran order.

Finally, we need to issue two warnings. The first applies to people who want to go beyond 2-dimensional arrays. Everything in the previous paragraph still applies and the dimensions must be presented in the order opposite to those in the array declaration. That does not change. The warning concerns all versions of Root up through v2_22_09. There is a known bug in TBranch that causes it to mishandle variable-length arrays with more than 2 dimensions. Fixed-length arrays are immune to this bug. Everything appears to run but a large part of the array is actually filled with 0.0. The good news is that this bug is scheduled to be fixed in the next release.

[Update. The promised fix was indeed made and Root versions after v2_22_09 no longer suffer from this problem. This warning is now of only historical interest and will vanish after some time passes.]

The second warning applies to users of the HBook manager. For various architectural reason, the following limitations are imposed by HBook:

There has already been at least one case of a user exceeding that doubles limit and getting a set of very murky error messages and a terminated run for their trouble. To avoid sharing that fate, you should be quite judicious in your use of doubles if you intend to use the HBook manager. Another obvious strategy is to simply abandon the HBook manager and use the Root manager instead.

John Marraffino
marafino@fnal.gov