NSF Grant Triggers Wide Computing Possiblities form BTeV
by Mike Perricone
But with a $4.98 million National Science Foundation grant in the area of Information Technology Research, Fermilab's B-physics at the Tevatron experiment (BTeV) just might help solve the puzzle of "Why don't things always work as well as we'd like?"
That question represents the theme for educational outreach components of the effort to build a fault tolerance system into the BTeV trigger and data acquisition project. BTeV's goal is assembling as many as 10,000 parallel computers and making them work together dependably and consistently in the triggering and DAQ system-despite incorporating different kinds of computers with different tasks. The BTeV trigger will be challenged to reconstruct 15 million particle events per second, and to use that reconstruction data in deciding which events to keep for further analysis. It will be further challenged to perform the reconstructions around the clock -while spotting and correcting any problems that arise.
The idea of self-awareness or introspection in computers is not new. But the idea of achieving "fault tolerance" or self-correction at this level of complexity is new and intriguing.
"People have written fault tolerant systems for smaller numbers of computers, on the order of hundreds," said BTeV cospokesperson Joel Butler. "But when you get to the ambitious level we're working at, with perhaps 10,000 computers, those ideas do not scale up. You can't just change the number of processors and have it all work outÍ In this very self-aware computing system, the software will be expected to solve problems from the level of the smallest processor in the system all the way up to the level of whether the whole thing is really behaving as expected."
Imagine the possibilities.
"This is a very hot topic in electrical engineering and computer science right now, this concept of evolvability and fault tolerance " said BTeV collaborator Paul Sheldon of Vanderbilt University, the project's principal investigator. "With thousands of components, you'll always have something going wrong, somewhere. You want a system to be able to adapt to a fault because if it doesn't, you'll crash or miss something critically important. It [fault tolerance] is going to be useful in complex systems such as weather monitoring. Or in vehicle navigation where you can literally crash. Think of the country's air traffic control system, which is really old and can't be easily upgraded. That's why we try these things [in science] first-to make them work without the agony. Technology like this eventually percolates down, and hopefully it will someday make your own computer crash less often."
Both the technology and the thinking will also percolate beyond Fermilab, beyond the four collaborating universities (Vanderbilt, the University of Illinois, Pittsburgh and Syracuse), and beyond the graduate students who will be working on the project.
Adapting the QuarkNet model established by Fermilab's Education Department, the BTeV trigger computing project aims to involve high school teachers. The QuarkNet method trains high school teachers to train other teachers, as well as connecting students through the Web to ongoing particle physics experiments. The BTeV educational adaptation would include exercises in the concepts of exception handling and fault tolerance-in other words, how to work around glitches without an entire structure coming apart, in day-to-day applications. What happens if it rains on graduation day? How does a baseball league schedule work, especially when games are canceled? How is the production of a play affected when understudies must perform?
Underlying the computing connections is a basic tenet of science: the need for a methodical way of thinking, of exploring the consequences when things go wrong, of devising plans to correct or work around those consequences.
"It's an important part of scientific literacy for the general public," said Marge Bardeen, head of Fermilab's Education Department. "Having been through an experience of how science works, they would gain a better understanding of basic research and see its value. They would gain a better understanding of how to make careful and responsible decisions about science, about funding and other issues. Also, we don't often teach science as an experimental, research-based, kid-centered discipline. We don't often teach science the way science is done. First, how do we help teachers understand how scientists work; and second, how do we figure out how to do that in a classroom? That's what QuarkNet tries to do, and that's what the BTeV group will try to do."
The BTeV trigger system (click here for graphic) distinguishes itself by essentially merging with the experiment, assuming the role of part of the apparatus. The trigger system will reconstruct every bunch crossing of the Tevatron-bunch crossings occur at 7.6 million per second, or 132 nano-seconds apart. The data system will attempt to find all the tracks and interaction vertices, looking for evidence that there is a decay downstream of the interaction vertex which could come from a b-particle. Then it thinks about which events to keep and which to discard.
"The trigger must work reliably and quickly, over a long period of time," said Fermilab physicist Erik Gottschalk, who has worked on designing the trigger system. "This process is not being done off-line. It's integral to the experiment itself instead of being a step removed, as it would if it were being handled off-line. If it fails, it affects the data. Everything counts on the trigger."
And that trigger will count on the fault-tolerance software developed with the help of the NSF grant, approximately $1 million per year for five years, already effective as of October 1. BTeV applied for the grant after a Fermilab technical review of the experiment proposal suggested strengthening the fault-tolerance aspects of the system. Collaborators reached out to people at their own institutions who were conducting this kind of research-the Institute for Software Integrated Systems at Vanderbilt, the Coordinated Sciences Lab at the University of Illinois, the(research group) at Syracuse and the (research group) at Pittsburgh. Together, the experiment and university collaborators wrote a proposal that survived competition with thousands of other entries, emerging with a share of $156 million which NSF has targeted "to preserve America's position as the world leader of computer science and its applications."
NSF is especially interested in possible applications, scientific and commercial. The BTeV proposal points to a wide range of uses including medicine (data acquisition in Positron Emission Tomography), astrophysics (the Pierre Auger Cosmic Ray Observatory and its 1,600 detector stations), vehicle navigation, weathering monitoring and disaster warning systems, widely-available Internet services-and others yet to be described. In fact, the collaboration intends to hold a series of workshops, inviting representatives from these areas of technology, to discuss these connections and expand the list.
Sheldon, as principal investigator of the project, coordinates the apportioning of resources. He points out that the funds are directed specifically to computer scientists.
"No physicists are actually being funded by this grant," he said. "The whole point was to bring in people from other disciplines."
Butler, whose experience dates back to early fixed-target experiments at Fermilab, is enthusiastic about expanding the formal connections between high-energy physics and computer science among several institutions.
"You would think it's the most natural of collaborations," Butler said, "high-energy physics with its complicated computer needs, and university computer scientists with their resources. But there really haven't been that many examples. It's exciting that NSF has opened up this possibility."
On the Web:
National Science Foundation http://www.nsf.org
The BTeV Trigger Movie http://www-ppd.fnal.gov/btev-trigger-w/presentations/Animated_trigger
|last modified 11/9/2001 by C. Hebert email Fermilab|