Genomics, transcriptomics, proteomics, big words describing huge, growing sets of data. The challenge, however, lies not so much in adding more and more to these databases, but in extracting from them the information that can be used to explain our complex biology and what makes it sometimes go awry.

In today’s PNAS early online edition, researchers from the labs of John Rioux and Eric Lander at the Whitehead Institute/MIT, together with colleagues from Quebec, Canada, and Odense, Denmark, have done just that. They have combined analysis of several databases to reveal a single gene responsible for a debilitating illness, Leigh syndrome French Canadian type (LSFC). In this disease, a deficiency in cytochrome c oxidase (COX) leads to neurodegeneration in the brainstem and basal ganglia, and often to acute metabolic acidosis and coma. Being the final electron carrier in the respiratory chain, COX is absolutely required to keep oxidative phosphorylation-and us-alive and kicking.

Previous work from the principle author's labs had mapped the candidate gene for LSFC to an approximately two million base pair-region on the short arm of chromosome two. In the present work, the integrative approach of first author Vamsi Mootha et al. combined whole genome analysis of this region with analysis of mitochondrial protein and mRNA databases.

First, by analyzing genome databases, Mootha et al. found 15 distinct genes in the LSFC stretch of chromosome two, but none of them had any known connection to mitochondrial function. The authors then examined each of these genes using large-scale mRNA expression data sets. They found that the expression profile of one of them, LRPPRC, closely matched that of transcripts for mitochondrial proteins, suggesting it may be a likely candidate for the LSFC gene. The icing on the cake came when Mootha et al. generated a proteome database of mitochondrial peptides by mass spectrometry and cross-referenced it to the LSFC gene region. All the mitochondrial peptides that could be coded for in this region of the genome mapped to the same gene, LRPPRC.

To confirm that this gene is indeed responsible for Leigh syndrome, Mootha et al. performed sequencing analysis of patient, carrier, and control DNA samples and found two mutations in LRPPRC, a single base pair transition leading to an alanine to valine substitution in exon 9, and an eight nucleotide deletion in exon 35 leading to truncation of the protein.

While the power of this integrative approach is obvious in this case, there may be shortcomings in applying it to Alzheimer's and other diseases (see comments below by Vamsi Mootha, Gerard Drewes, and Stephen Ginsberg).—Tom Fagan


  1. Extended Author Comment by Vamsi Mootha

    In our manuscript we have provided a framework for prioritizing candidate genes that lie within a region implicated by linkage or association. Conceptually, the strategy attempts to systematically relate disease features with gene properties, relying on the richness of large-scale experimental datasets, without the need for patient biopsy specimens. The approach is very Bayesian, in the sense that it begins with an input hypothesis about what pathways or processes are putatively involved in the clinical syndrome, and scans the available functional genomics datasets for other genes likely to be in these pathways/processes. As we illustrate in our paper, the integrative approach can really accelerate human disease gene discovery.

    In the proof of principle, we considered Leigh Syndrome French Canadian variant, an autosomal recessive disease that was mapped to a 2 Mb region, with clinical features suggestive of involvement by mitochondria. We developed a technique called "neighborhood analysis" that scans large scale microarray datasets and ranks all genes (most of which are of previously unknown function) on their expression similarity to a target pathway. In our case, this target pathway was the list of well-characterized nuclear encoded mitochondrial genes. We also developed a strategy to map tandem mass spectra data to genomic intervals, using proteomics data relevant to the target pathway. Again, we used data from an ongoing organelle proteomics project to score the entire disease interval for protein evidence for mitochondrial involvement. Most of these functional genomics datasets are noisy; they suffer from incomplete coverage, low sensitivity and specificity, however, we found that when all these datasets were combined, they consistently pointed to the LRPPRC gene. Confident this was the disease gene, we systematically screened all 38 exons and discovered the underlying mutations.

    I should emphasize that our result represents a proof of principle. We selected a disease for which there was a well-defined underlying pathway and that was mapped to a relatively small region (5.1 cM). For many diseases, there may not be a clear underlying process and the disease interval may be extremely large. Furthermore, we were fortunate that available microarray datasets provided some coverage of our disease interval, and that we had organelle proteomics data. But I strongly believe that with the growing wealth of functional genomics data, these types of approaches will become increasingly powerful.

    I could imagine creating a target gene set called "neurodegeneration-related genes" in which a team of experts carefully collate a set of characterized genes involved in neurodegeneration (e.g., genes involved in Mendelian ND syndromes, members of ROS homeostasis pathway, mitochondrial complex I, presenlinins, tau proteins, protein degradation pathways, apoptosis, etc). Then, we could apply neighborhood analysis to the entire universe of microarray data to rank all genes based on expression similarity to "neurodegeneration-related genes." Furthermore, I could imagine querying yeast protein interaction network datasets to identify poorly characterized genes whose yeast orthologues physically interact with yeast orthologues of "neurodegeneration-related genes." The results of these scans can then be intersected with genomic intervals implicated by linkage or association to spotlight top candidate genes. Genes appearing in all lists would have more "evidence" for involvement in neurodegeneration and can then be subjected to an association study or for mutation screening.

    What are the current limitations for applying the approach to AD or ND? Well, the integrative approach relies fundamentally on three inputs: (1) putative hypothesis about the underlying disease pathways/processes, (2) genomic intervals implicated by linkage or association, and (3) large-scale experimental datasets.
    AD and ND are well-studied diseases, and I feel that there is a substantial body of literature that has already identified some key genes and pathways likely to contribute. So creating a "neurodegeneration-related gene set" should be possible today.

    For the common versions of these diseases, the linkage peaks are relatively large, often containing hundreds of genes. However, future SNP or haplotype-based association studies in concert with larger sample sets will improve this resolution.
    At present, functional genomics datasets are relatively sparse. Novartis, RIKEN, and the Whitehead have made some very nice, large microarray datasets available to the public (which we used in our work), but these are primarily tissue compendia. We need higher-density RNA-, protein-, and metabolite-based "snapshots" of cells in action during development, in response to environmental stimuli, during aging, etc. The richness of these datasets defines the functional resolving power of the integrative genomics approach. It would be terrific if more pharmaceutical companies made their functional genomics datasets publicly available. The NIH has sponsored some large projects to generate large-scale datasets for specific diseases, such as cardiovascular disease and diabetes.

    It would be great to apply the approach to existing AD genomic intervals, using even the existing functional genomics datasets. We’ll know if the approach works after we’ve tried.

  2. Lander and colleagues combine data from genomics, transcriptomics, and proteomics, to find the gene causing LDFC, a rare monogenetic disease. In this disease, mitochondrial function is compromised. The emerging candidate gene is (i) one of about 50 genes within the affected locus, (ii) co-expressed with known mitochondrial genes, and (iii) its product was detected by proteomic analysis of purified mitchondria. Reseqencing of the gene in affected patients confirmed the presence of mutations. Together with the recent paper by Perez-Iratxeta et al., 2002 describing computational approaches to pinpointing disease genes, this one is a landmark paper, in that it provides a fast alternative to the tedious procedure of positional cloning. Hopefully, similar holistic approaches will also be helpful in selecting candidate genes in more complicated polygenetic diseases like AD or Parkinson's.


    . Association of genes to genetically inherited diseases using data mining. Nat Genet. 2002 Jul;31(3):316-9. PubMed.

  3. This convincing report provides integrated DNA, RNA, and protein data that is used for disease gene discovery in a rare form of a mitochondrial disorder named Leigh syndrome, an autosomal recessive cytochrome c deficiency. Specifically, the authors combine state-of-the-art computational genetics, genomics, and proteomics-based approaches to uncover two mutations in a gene termed LRPPRC (leucine-rich pentatricopeptide repeat-containing protein) from the French-Canadian variant of Leigh syndrome (LSFC). Clinical manifestations of Leigh syndrome and LSFC include neurodegeneration, metabolic acidosis, and coma (Merante et al., 1993; Morin et al., 1993; Rahman et al., 1996). Children afflicted with Leigh syndrome have pronounced developmental delay and a mean life expectancy of 3-5 years. Leigh syndrome affects approximately 1/40,000 newborn infants in the worldwide population, whereas LSFC found in the remote Saguenay-Lac St-Jean region of Quebec occurs in approximately 1/2,200 live births. This research group previously mapped LSFC to chromosome 2p16-17 (Lee et al., 2001) via genome-wide association linkage studies, and has elegantly applied this information in concert with cDNA microarray analysis and mitochondrial proteomics to identify candidate gene(s) implicated in LSFC. Essentially, the paradigm that the group employed was to analyze and integrate three distinct data sets. The first data set contained human genome sequence information with gene annotations for the LSFC candidate region defined by microsatellite markers at 68.2 and 73.3 centimorgans on chromosome 2. Searching several databases yielded the detection of fifteen non-overlapping gene candidates; another fifteen non-overlapping predictions of genes were found using Genescan. The second data set comprised of cDNA microarray analyses from four public mRNA expression databases, whereby the authors generated a mitochondrial neighborhood index to identify clusters of candidate mitochondrial genes, since LSFC disease-genes are likely to be mitochondrial or at least related to mitochondrial function and/or regulation. The third data set consisted of tandem mass spectroscopy (MS/MS) on mitochondria extracted from human HepG2 cells to create a mitochondrial database to probe for disease-related genes.

    By combining and integrating these approaches, the group consistently observed the LRPPRC gene as a potential target. Proof of concept was determined by sequencing the LRPPRC gene via PCR using subjects with LSFC and normal controls. These endeavors culminated in the detection of an A354V missense mutation in exon 9 (21/22 patients studied were homozygous for the mutation; 1 was heterozygous). The heterozygous subject was also found to have an eight nucleotide deletion in exon 35 that resulted in a premature stop codon at amino acid 1,277. Thus, the combination of these powerful informatic and genomic/proteomic techniques provided a means to identify a disease gene in a rare form of an autosomal recessive mitochondrial abnormality.

    I commend the authors for the tremendous amount of effort and resources that no doubt went into this project. It is also important to recognize that this paradigm may be a useful strategy to identify disease genes or disease-related genes in other human disorders. The global view and integrative power of these types of experiments cannot be underestimated. In addition, human disease samples were not necessary until after the majority of the integrative studies were completed. Whether or not this type of approach can be applied to a complex, progressive late-onset disorder such as Alzheimer’s disease is yet to be determined. One may hypothesize that sporadic AD may prove to be extremely difficult due to its complex multigenic and/or environmental nature. However, forms of familial AD that are not related to the known mutations in APP, PS1, and PS2 may be good candidates for this type of integrative approach. In conclusion, studies such as this report by Mootha et al., exemplify new directions that are on the horizon due to technical and computational breakthroughs in genomic and proteomic applications.


    . A genomewide linkage-disequilibrium scan localizes the Saguenay-Lac-Saint-Jean cytochrome oxidase deficiency to 2p16. Am J Hum Genet. 2001 Feb;68(2):397-409. PubMed.

    . A biochemically distinct form of cytochrome oxidase (COX) deficiency in the Saguenay-Lac-Saint-Jean region of Quebec. Am J Hum Genet. 1993 Aug;53(2):481-7. PubMed.

    . Clinical, metabolic, and genetic aspects of cytochrome C oxidase deficiency in Saguenay-Lac-Saint-Jean. Am J Hum Genet. 1993 Aug;53(2):488-96. PubMed.

    . Leigh syndrome: clinical features and biochemical and DNA abnormalities. Ann Neurol. 1996 Mar;39(3):343-51. PubMed.

Make a Comment

To make a comment you must login or register.


No Available References

Further Reading

No Available Further Reading

Primary Papers

  1. . Identification of a gene causing human cytochrome c oxidase deficiency by integrative genomics. Proc Natl Acad Sci U S A. 2003 Jan 21;100(2):605-10. PubMed.