This week researchers from the Broad Institute at MIT and Harvard unveiled a public database of genomic signatures for 164 drugs and bioactive compounds in human cells. In a paper published in today’s Science, the group, led by Todd Golub and Eric Lander, dubs the data set the “connectivity map,” because the data, along with new pattern-matching software, allow researchers to make connections among genes, diseases, and drugs. Of particular interest, they used the information to show that the gene expression profile associated with an inhibitor of amyloid-β (Aβ) fibril formation was the mirror image of the expression signature of human Alzheimer disease brain. The map, which Golub and coworkers hope to expand to cover more compounds and more cell types, should facilitate the discovery and development of new drugs for many conditions.

In other big-picture studies, papers appeared this week showing an interesting pyramid scheme—that’s the shape of transcription factor regulatory networks as analyzed by Mark Gerstein and Haiyuan Yu of Yale. This week also saw the public release of a huge genome-wide SNP analysis of Parkinson disease from Andrew Singleton and John Hardy at the NIH, and a host of collaborators

The ultimate goal of the Broad group is to produce a large public database of gene signatures which can be queried to find common patterns associated with diseases, drugs, and physiological processes. To start, the researchers used gene expression in cultured human cells as a model system. Since the variations in cell types and treatment options are limitless, they settled on exposing an easily grown human lung cancer cell line (MCF7) to a selection of 164 different drugs and bioactive compounds for 6-12 hours. Over the course of a year, first author Justin Lamb and colleagues collected 564 sets of genome-wide mRNA expression profiles.

To generate an expression signature for each profile, they ranked expression of the 22,000 genes measured in each sample according to how much they varied from control levels. Induced genes got a positive sign, repressed genes a negative. This unit-less scoring system allows comparison of data across experimental platforms. To query the database, they took expression data from published reports, generated a similar signature, and compared it to all the signatures in the database to generate a ranking of all the database profiles based on similarity. Thus, if many of the same genes appear near the top of the list, and the negative near the bottom, the researchers called that positive connectivity. On the other hand, if plus genes in the query appear near the bottom and negative genes near the top in a database profile, there is an inverse relationship between the states, which they call negative connectivity. By assigning a connectivity score between +1 and -1 to the profiles, the researchers ranked the reference sets from the most strongly correlated to the most strongly anti-correlated.

To illustrate how data can be used, the researchers showed several examples of possible queries and outcomes. Gene expression data from published studies on small molecules, such as histone deacetylase (HAD) inhibitors, were positively connected with their samples using other HDA inhibitors in the database. Query profiles of estrogen-treated cells connected positively to the estrogen-treated samples in the database, and negatively to an anti-estrogenic compound. The researchers also showed that they could detect similar gene profiles for a number of the common anti-psychotic phenothiazines, despite the fact that the data on gene expression was not collected in neurons. Thus, connections could be made despite the use of different cells types and concentrations of drugs used in the query and database. Two papers published concurrently in Cancer Cell show in more detail the use of the database to discover the mechanisms of action of a novel compound and to identify novel actions of known drugs (Hieronymus et al., 2006; Wei et al., 2006).

To look for connections to disease states, the group queried the database with profiles derived from two independent reports of gene expression changes in Alzheimer disease brains. Although the two profiles had no genes in common, they both yielded negative connectivity with profiles of cells treated with a compound 4,5-dianilinophthalimide (DAPH). DAPH was discovered in an in vitro screen for small molecules that could inhibit formation of Aβ fibrils, and analogs have been produced as potential AD treatments (see ARF related news story and Hennessy et al., 2005). This example suggests that the method could be used to discover new drug candidates based on disease-specific perturbations in gene expression in AD, and in other neurological diseases.

There are some limitations to the method. The researchers did not find connectivity in their database with some dopamine receptor antagonists, because the test cells (MCF7) lack dopamine receptors. Also, as of yet there is no good way to judge the statistical significance of the connectivity scores.

The authors liken the new method to comparing DNA sequences: eventually, they would like to describe all biological states with transcriptional signatures. Calling the database the “first installment” of a reference collection of transcriptional profiles, they propose a community effort to expand the connectivity map to cover more cell types, more compounds, and more treatment variations. The current set of data is available on the Web at, where users can query the resource with their own signatures.

Transcriptional profiles are set in place by the combined action of many transcription factors, and the question of how these networks are shaped is the subject of another paper out this week, this one in PNAS Early Edition. Haiyuan Yu and Mark Gerstein from Yale University in New Haven, Connecticut, looked at the gene regulatory networks in Escherichia coli and yeast, and found that transcription factor hierarchies take on a pyramid shape. By analyzing how the factors are regulated, and regulate, they found there are a few big bosses at the top who get inputs from protein-protein interactions with cell signaling molecules. The top dogs then control a layer of middle managers, who regulate a larger class of low-level factors. In the scheme, the middle managers directly control more targets than the top level. And while the top factors are most influential, they were often dispensable, while the lowest level tended to be essential to cell survival. They speculate that this is because the top-level factors modulate pathways, while those at the bottom are solely responsible for turning on specific critical genes. The hierarchy resembled efficient corporate and government structures.

Last but not least, in other news of the big-picture type, a multi-institute group of researchers, headed up by Andrew Singleton at the National Institute on Aging in Bethesda, Maryland, have published the first genome-wide single nucleotide polymorphism (SNP) map for Parkinson disease. With samples from 267 PD patients and 270 normal controls, the researchers genotyped more than 408,000 unique SNPs, generating 220 million genotypes. The study, which appeared online this week in the Lancet Neurology, did not replicate the SNPs implicated in a smaller study published last year (Maraganore et al., 2005), and indeed the new data identified no SNPs strongly associated with the risk of PD. However, the study did identify loci that would be candidates for further examination in larger studies with additional populations. As the authors conclude, “These data suggest that there is no common genetic variant that exerts a large genetic risk for late-onset Parkinson’s disease in white North Americans. These data are now available for future mining and augmentation to identify common genetic variability that results in minor and moderate risk for the disease.” This genotyping data represents the largest public release of high-density SNP data outside of the International HapMap Project.—Pat McCaffrey


Make a Comment

To make a comment you must login or register.

Comments on this content

  1. This study clearly demonstrates what we have thought all along, and what various large-scale approaches have already shown: that genomic data from reference standards of known mechanism or phenotype are vital in order to fully extract value out of novel expression patterns. Two of the major obstacles in realizing this value were the resources required to generate a sizable reference dataset and the ability to adequately compare results from alternative expression platforms, cell types, models, or species. With the introduction of a simple, non-parametric test, well-conserved expression changes can be compared. With the large public expression databases, such as the NCBI Gene Expression Omnibus and others, additional mechanistic insight should be systematically extracted from these datasets. What this capability brings to the field is the opportunity to use gene expression profiles to test hypotheses, rather than using them in an exploratory mode, or in fishing expeditions, as they are cynically referred to.

    As highlighted in this paper, many gems have already been found. However, care must be taken, as many misleading connections are likely to be made, particularly as the size of the connectivity map grows. It’s not clear from the paper how many misleading or incorrect connections were identified by the approach, particularly given the lack of any probabilistic approach to evaluate potential connections. With further development of statistical methods, these errors should be reduced, but not entirely eliminated. A more rigorous approach using supervised classification models has already been developed to overcome these limitations and provide greater classification accuracy than ranking methods (Natsoulis et al., 2005). These methods should prove of greater value in dissecting diagnostic signatures of drug action, pathology, or disease states.

    As suggested by the authors, a more comprehensive database composed of more cell types should broaden the scope of the connectivity map for capturing more mechanisms that may be context-dependent. However, while cell-based models are higher throughput and more cost-effective than in vivo models, single cells won’t best represent complex pathologies or disease states that encompass the interaction between multiple cell types or organs, so we should be cautiously optimistic about the scope of what can be identified with such an in vitro connectivity map. I think the greatest value will be in understanding drug action at the molecular level—a laudable goal. However, predicting complex phenotypes, such as adverse side effects in humans, should be approached with caution and a weight-of-evidence approach.


    . Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. Genome Res. 2005 May;15(5):724-36. PubMed.


News Citations

  1. Focus on Aβ Fibrils: Targeting β-sheets and Foiling Them

Paper Citations

  1. . Synthesis of 4,5-dianilinophthalimide and related analogues for potential treatment of Alzheimer's disease via palladium-catalyzed amination. J Org Chem. 2005 Sep 2;70(18):7371-5. PubMed.
  2. . High-resolution whole-genome association study of Parkinson disease. Am J Hum Genet. 2005 Nov;77(5):685-93. PubMed.

External Citations


Further Reading

No Available Further Reading

Primary Papers

  1. . The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006 Sep 29;313(5795):1929-35. PubMed.
  2. . Genomic analysis of the hierarchical structure of regulatory networks. Proc Natl Acad Sci U S A. 2006 Oct 3;103(40):14724-31. PubMed.
  3. . Genome-wide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data. Lancet Neurol. 2006 Nov;5(11):911-6. PubMed.
  4. . Gene expression signature-based chemical genomic prediction identifies a novel class of HSP90 pathway modulators. Cancer Cell. 2006 Oct;10(4):321-30. PubMed.
  5. . Gene expression-based chemical genomics identifies rapamycin as a modulator of MCL1 and glucocorticoid resistance. Cancer Cell. 2006 Oct;10(4):331-42. PubMed.