Fermi National Laboratory

Volume 24  |  Friday, March 16, 2001  |  Number 4
In This Issue  |  FermiNews Main Page

The Odds of Discovery

by Kurt Riesselmann

How do scientists know when their experimental results add up to a discovery? If they are 99 percent sure, is that sure enough?

In day-to-day experience, a chance of 99 percent often seems like a sure thing. Many people would bet a month's salary on an event with such a probability. But it's easy to imagine cases in which a one-in-a-hundred chance seems far too high.

Would you cross the bridge over a deep canyon if there were a one percent chance it would collapse?

Scientists analyzing their experimental data are also cautious. For them, claiming a result with 99 percent certainty leaves plenty of room for Mother Nature to prove them wrong--with career-wrecking consequences. Accordingly, scientists have developed a careful language to describe a promising result. Their keywords are "hint," "indication" and "evidence," all of which fall short of the actual claim of a "discovery."

The criterion used to decide which word to pick is hidden behind a simple Greek letter: s. Pronounced "sigma," this symbol is the unit that describes how reliable a result is.

Bill Carithers 1994 CDF cospokesperson "Most people in the field would agree that if the significance of data is less than 3 sigma then the result might be just a fluctuation," said Bill Carithers, physicist at Lawrence Berkeley National Laboratory. "If the significance is greater than 5 sigma it is a discovery. In between, there are various shades of gray."

Scientists refer to sigma as the "standard deviation." It is the decisive parameter of the Gaussian curve, a mathematical function that describes the distribution of data from many simple experiments. Citing a certain number of sigma directly translates into a probability. Three sigma, for example, is equivalent to a 99.75 percent chance that a future experiment will yield a compatible result. Scientists, though, wouldn't trust such a result to withstand future scientific scrutiny.

"We have published hundreds of papers with thousands of numbers," said Carithers, a member and former spokesman of the CDF collaboration at Fermilab. "You expect some of the results to be outside the 3 sigma range."

Physicists think that only a 5-sigma result, indicating a 99.99995 percent chance that the result can be reproduced, is trustworthy and can survive the test of time.

In 1994, Carithers and about 400 CDF colleagues faced the dilemma of evaluating the significance of the first top quark data. The collaboration decided to publish a paper proclaiming "evidence."

"There was a long pause in data taking during Run I," Carithers said. "We had analyzed the first data, and we thought it was important to tell our colleagues what we had. If we had been able to immediately continue collecting more data, then we might have held back on publishing the results."

It took another year before new data boosted the significance of their original result and allowed for claiming the top quark discovery, simultaneously with the DZero collaboration at Fermilab.

"Our discovery paper [in 1995] was based on 4.8 sigma," Carithers recalled. "But there were corroborating pieces of evidence. We had the feeling that the case was actually stronger."

Identifying a new particle and determining the significance of its signal is quite different from rolling a pair of dice and calculating the probability of the score. Particle physicists have to study background events, which are created by other particles and leave similar signals. Separating a desirable signal from background look-alikes amounts to identifying the face of a specific person in a blurred photo of a large crowd.

Scientists need to know characteristic details of both signal and background events to filter the data and obtain a sharper image. If plenty of "photos" and good filtering techniques exist, physicists can reconstruct the "image" of a new particle.

Simulations are important in order to judge how much an anticipated signal could differ from the expected background noise.

"We use both data and simulation to understand the background and the reliability of our results," explained John Conway, CDF physicist at Rutgers University. "To calculate the significance of a discovery, we actually simulate a large ensemble of pseudo experiments. For each pseudo experiment we generate a certain number of events that we would have seen in the detector."

Carrying out the simulations involves plenty of challenges and opportunities for error.

"You have to understand exactly how your detector is working to determine the background with particular uncertainties," Carithers pointed out. "Your detector can miss a track or can manufacture a signal out of noise. These uncertainties enter every single event."

In a recently completed a report on the discovery potential of the Tevatron, Conway and several colleagues studied how many proton-antiproton collisions it would take to produce a significant number of Higgs bosons, postulated force carriers that could explain why some but not all particles have mass.

Identifying enough Higgs events among the wealth of particle signatures produced in the collisions of Collider Run II will take several years (see graphic). If the Higgs is too heavy to be produced at the Tevatron, physicists could report the first exclusion limits in less than three years. Those limits rely on a 95 percent confidence level, a lower standard than the one used for discoveries.

"We don't worry as much about falsely excluding the Higgs boson," Conway said. "It is much more important to avoid a false positive."

Of course Conway and his colleagues are much more excited about the chance of finding a true positive signal of the Higgs at the Tevatron.

However, you can bet they won't take a chance on claiming its discovery too early.

Would you bet on this game?

The Gaussian curve provides a measure to judge how the outcome of a game or a scientific measurement compares to the "true" or "ideally expected" result.

Assume you throw a coin 100 times and you observe 34 heads. Although a single game has no statistical significance, you start to worry. Is the game rigged? Is your assumption that it is a fair coin wrong?

The Gaussian curve helps you to determine how likely your 34-heads score is. The coin experiment has a most likely outcome of 50 heads (called mean value), which can be determined by numerical simulation or by repeating the game with a fair coin many times. The Gaussian curve then introduces a quantity called the standard deviation, denoted by the Greek symbol s (sigma). Through repetition or simulation, scientists are able to determine that s equals 5 in the 100-coin-throw experiment. This fixes the statistical significance of the outcome of the experiment (see table). It states, for example, that 68.5% of all 100-coin-throw experiments lead to 45 to 55 heads, called the 1s range (50s, with s=5).

Your result of 34 heads is just outside the 3s range (503s). The Gaussian curve predicts that less than 0.25% of all games with a fair coin have a result outside the 5015 range. By taking a close look at your coin (the experimental apparatus) and repeating the experiment many times you can reveal whether your result of 34 is a statistical fluke or the coin is flawed.

The Gaussian curve, also called a bell-shaped or normal distribution, can be tall and thin or flat and wide: only its relative height is important. It often fits the data of unbiased experiments that allow for a symmetric outcome of the measurements (equal chance of recording a result larger or smaller than the mean value) and no constraints on the experimental value (numbers from minus infinity to plus infinity). Most experiments, including the coin game, do not satisfy these requirements; but the Gaussian distribution still yields a satisfactory description. However, in experiments looking for rare events, such as the Higgs boson, scientists must find better mathematical distributions to describe their data. After finishing their analysis, scientists usually return to the well-known Gaussian standard deviation s to indicate the significance of their results to people not familiar with the details of their work.

last modified 3/16/2001 by C. Hebert   email Fermilab