Friends of Friends
Fermilab astrophysicists apply galaxy clustering analysis methods to genome research
by Mike Perricone
Is there really a Big Dipper?
That most famous of celestial signposts in the constellation Ursa Major, the Big Dipper is essentially a random distribution of stars that our eyes assemble into a pattern because that s what our eyes are designed to do. We see a string of lights across a dark background, and we draw a picture in the night sky.
In three dimensions, we would have a different image: seven stars at divergent distances. The light reaching our eyes at any moment has departed these seven sources at scattered times. While the middle five stars (Merak, Phecda, Megrez, Alioth and Mizar) are actually part of a cluster averaging 80.6 light-years away, the tip of the handle (Alkaid) is 100 light-years distant. And the pointer stars, showing the way to northern beacon Polaris, are 45 light-years apart: Merak at 79 light-years from earth, and Dubhe at 124 light-years. The numbers tell us our eyes have been deceived.
Separating apparent patterns from significant patterns is a specialty of the Sloan Digital Sky Survey, mapping one-fourth of the northern sky in three dimensions, working on a cosmic scale well beyond that of individual stars to sort out details of large-scale structure.
We think we have a good idea of what causes galaxies to cluster together, said Fermilab theoretical astrophysicists Josh Frieman. The universe may have started off in an inflationary stage, which produced quantum vacuum fluctuations, which were amplified by gravity to form galaxies which cluster together.
Clustering is the key concept that began a process linking the statistical tools of astronomy and astrophysics to another field of research conducted at a far different scale. The process culminated in applying algorithms called friends of friends and the autocorrelation function, commonly used in astrophysics, to the microscopic realm of genome research.
At Rush University in Chicago, researcher Gabor Firneisz and his colleagues were investigating structures within the recently-completed map of the mouse genome. Specifically, they were searching for gene clusters related to autoimmune arthritis. But if they spotted a pattern, would the numbers back them up? Or would mathematics tell them that their eyes had been deceived? And what tool could they use for the test?
I got a telephone call one day, out of the blue, Frieman said. [Firneisz] had been able to map locations within the genome of genes associated with arthritis, and he had noticed that these particular genes didn t seem to be randomly distributed throughout the genome. They seemed to be clustered. He wanted a way to quantify that. He knew that in astrophysics, that s one thing we do: we quantify the clustering of galaxies. He found our web page, and called me. We talked for a while, he educated us in genetics, and we educated him in statistical astronomy.
The result of the mutual education was a distinctive collaboration: a small group (six members), combining two distant disciplines, genetics and astrophysics. That collaboration produced a paper, A Novel Method to Identify and Quantify Disease-Related Gene Clusters, which has been submitted to the journal Bioinformatics, published by Oxford University Press. The journal focuses on new developments in genome bioinformatics (information from molecular biology and genome research) and computational biology.
We're obviously not the first people to apply statistical techniques to the genome, Frieman said. But as far as we can tell, we were the first to apply this kind of analysis to this issue of gene clustering.
Frieman's colleague, postdoctoral researcher Idit Zehavi of Fermilab and the University of Chicago, took on the major responsibility for calculating the autocorrelation function and applying the friends-of-friends algorithm to the Rush researchers genome data. The latter technique is based in an area of statistical physics called percolation theory, and uses the same mathematical principles.
The great biological discovery of 50 years ago showed that DNA (Deoxyribonucleic Acid), the stuff of life, consists of two long strands twisted into a double helix. The strands are connected by bonds between the organic compounds adenine (A), cytosine (C), guanine (G) and thymine (T), which form these links by creating two base pairs: adenine with thymine (AT), and cytosine with guanine (CG). Relative distances along the helix are measured by counting the number of the base pairs, or mega base pairs (Mbp) between locations. The location in megabase pairs forms the basis of the statistical analysis of the genome, just as megaparsecs are used in statistical analyses of the universe.
The variable that you play with is the linking length, said Zehavi. You assume a length that has a physical meaning, or in this case, a biological meaning. We used scales from the correlation function, showing where clustering occurs. You connect one gene to anything within that linking length. Then you take that second gene and connect it with anything within the linking length, and then again, and again. Those connected belong to the same group, and are called friends of friends. You are close to me, so you are a friend. The next person is farther from me, but is close enough to you to be another friend. Friends within this linking distance stay within the group, so they are friends of friends.
Simple in concept, but as in all science, that's just the beginning.
In the SDSS, these methods are applied to hundreds of thousands of galaxies in three dimensions, while in the genome study the methods were applied to approximately 200 genes in one dimension, as points along the chromosomes. The genome mathematics might appear less challenging, but the questions Zehavi and Frieman faced were whether the astrophysics methods would actually work and whether they could become tools in more extensive genome analyses.
The first step was to measure what is called the autocorrelation function, comparing the structure of the sample to a completely random distribution. Is there more clustering in the sample than would randomly exist? It passed the first test: a significant non-zero autocorrelation function, ruling out randomness.
That first test is important, to see whether the distribution is random or if there really is some kind of clustering, Frieman said. When your eye looks at a number of random points, you will always see clustering because that s what evolution led us to do. We look for patterns. I could show you a completely random distribution of points in two or three dimensions, and you would say, Sure, it looks like these things are clustered. But this test established the reality of the clustering.
Once the reality of the clustering was established, Zehavi, who is originally from Israel and studied at the Hebrew University in Jerusalem, went to work with the friends-of-friends analysis. The linking length was applied to a location until it brought in no new friends-of-friends; at that point, the most prominent groups of genes were identified as clusters.
Two developments made this astro-genome analysis possible. The first was the complete mapping of the mouse genome, an extension of the Human Genome Project. The second was the development, within the last five years, of an experimental tool called the cDNA microarray, or DNA chip.
This chip houses actual organic DNA material, similar in concept to a tissue sample being placed on a glass slide for examination under a microscope. The DNA chip is used to check for the expression of genes under specific conditions; for example, conditions of a specific disease. The chip contains short stretches of genes from a sequenced genome. The DNA and proteins are removed from the cells, and the remainder is RNA, the gene message the link between DNA and the protein encoded by the RNA. This RNA is reconverted to DNA with an enzyme, and the resulting material is used to probe the chip. The DNA finds its partner on the chip and certain genes are turned on which the researcher can identify, because the design of the chip is a known quantity.
When a medical researcher wanted to try something different with this tool, checked the web and phoned Fermilab, a new area of interdisciplinary science research opened up. As Frieman pointed out, the statistical methods in astrophysics are used to test models and theories; in a sense, these statistical methods are being used to explore the genome and then to build models and theories.
Biologists have a lot of work ahead in explaining how the genome became structured, much more than we have to do in cosmology to explain how galaxies are distributed, Frieman said. So if anything surprised me, it s that in some sense we understand a lot more about how galaxies are formed and distributed in space, than we understand about the distribution of genes in our own bodies.
But research builds knowledge, knowledge offers possibilities, and possibilities especially in medicine offer hope.
I don't know the field well enough to appreciate all the things that might be done, Zehavi said. But if we find gene clusters that have significance in this disease I really hope this could develop into a new and unique tool to use in identifying gene clusters in other diseases. That s why I found this so rewarding. It s something that could potentially benefit humanity.
|last modified 3/21/2003 email Fermilab|