The crunching of masses of data has changed how people search the Internet, make friends, and purchase everything from soup to nuts. Now, the “big data” approach is poised to overhaul medicine, according to attendees at the Meaningful Use of Complex Medical Data, or MUCMD, conference, held 10-12 August 2012 at the Children’s Hospital Los Angeles in California. The name refers, tongue in cheek, to U.S. government financial incentives for doctors who ditch paper patient records in favor of the electronic versions. Only if the Centers for Medicare and Medicaid Services (CMS) deem e-health record use “meaningful” will it reward doctors up to $44,000 over five years.

MUCMD brought together all sorts of data aficionados, including statisticians, business people, clinicians, and engineers. While they did not specifically address any challenges in AD research or care, they discussed the potential application of medical data and the key obstacles in accomplishing their goals. Attendees would like to see medical recommendations based on hard data, rather than on what presenter Kenneth Mandl of Children’s Hospital Boston called “BOGSAT”—a Bunch Of Guys Sitting Around a Table.

The data are already out there—every doctor’s visit or hospital stay could be considered part of an experiment. What is missing, frequently, is a mechanism to access and analyze the data. And unlike controlled experiments, medical records provide a complex mélange of numerical and observational data, recorded at irregular intervals. The challenge is to sift through masses of records to find useful, actionable information. Computer scientists are still working on the best way to do that sifting. At the meeting’s hack night, programmers got a chance to test their ideas with a handful of datasets.

What might the data analysis yield? For starters, a clearer understanding of the full set of traits associated with a given disease, how much these vary, and how symptoms progress. It might help doctors discover biomarkers or risk factors for a variety of disorders, including neurodegenerative diseases. In the hospital, it could predict when a patient is about to suffer sepsis or another adverse event, allowing doctors to catch and treat the problem earlier. On a personal level, data might help individuals with diabetes understand why their blood sugar dips at certain points in the day. The possibilities are wide open—researchers are still speculating about what kinds of new correlations they might discover once they process the enormous datasets.

Hacker’s Delight
First, computer scientists need big databases. In neurodegenerative research, there are many organizations building such repositories. The Alzheimer’s Disease Neuroimaging Initiative (see ARF related series) and the Parkinson’s Progression Markers Initiative (see ARF related news story) gather brain images and biological fluids from people with or at risk for AD and PD, respectively. The National Alzheimer’s Coordinating Center has amassed longitudinal records from more than 25,000 people, and recently started assessments for frontotemporal dementia as well (see ARF related news story). Records from those who inherited an AD-linked gene are part of the Dominantly Inherited Alzheimer Network (see ARF related news story).

Other groups are collecting datasets to gain more information from completed clinical trials. The Northeast Amyotrophic Lateral Sclerosis Consortium and the Prize4Life, a nonprofit fighting the disease, put together the Pooled Open Resource-Access ALS Clinical Trials database (see ARF related news story). Similarly, the Coalition Against Major Diseases developed the C-Path Online Data Repository, which is freely available and includes placebo results as well as some treatment data (see ARF related news story).

In general, however, obtaining access to medical datasets remains a serious challenge, particularly when the records were not originally collected as part of a study, but simply in the course of medical practice. MUCMD presenter Peter Szolovits of the Massachusetts Institute of Technology noted that it once took him the better part of a year to navigate the committees and institutional review boards that had to approve access. MUCMD organizers gave attendees the opportunity to wade through real-life data during the hack night. Over pizza, beer, and soda, approximately 15 participants passed around datasets on computer memory sticks (plus the associated data use agreements). They argued and scribbled their ideas on whiteboards until the lights went off at 11 p.m.

A handful of datasets were available to play with. One included records, scrubbed of anything that could identify the patients, from a diabetes prediction challenge sponsored by Practice Fusion, an electronic health record provider in San Francisco, California. The goal of the competition is to put together an algorithm to classify people as diabetic or not, “giving doctors a picture of what is the characteristic diabetic patient,” said Jake Marcus, a data scientist at Practice Fusion, in an interview with Alzforum. He hopes this kind of analysis would give doctors a fuller understanding of the traits, risks, and complications associated with diabetes.

Data-based competitions such as Practice Fusion’s are a way to “plant the seed” of ideas and new questions for working with large databases, Marcus said. Neurodegenerative disease also has a place here; Prize4Life (which funds this reporter’s position) is offering $25,000 for an algorithm that predicts the progression of amyotrophic lateral sclerosis.

Making Data Meaningful
Scientists still grapple over how best to mine these databases. Many people put air quotes around the CMS guideline “meaningful” because no one knows precisely what that means, said David Kale, a computer scientist at Children’s Hospital and one of the conference organizers. At MUCMD, “we are talking about the real ‘meaningful’ use,” Kale quipped in an interview with Alzforum—that is, not just putting data into storage, but analyzing them and getting something in return.

To understand the potential of big data, think about the Framingham Heart Study, suggested Marcus. Starting with 5,208 people in the Massachusetts town, tracking them and their descendants for six decades thus far, the Framingham dataset has made major contributions to the study of heart disease and many other conditions. “Now,” Marcus told Alzforum, “imagine following 300 million people.” For example, early signs of Alzheimer’s might emerge from appropriate data analyzed in the right way.

Kale told Alzforum that the promise of big data is getting a lot of hype, but “there is still a whole lot of work to do.” In commerce and finance, big data has already made a difference. For instance, uses masses of data on what people buy to recommend books and other products you might want to purchase. The hope is that collecting and crunching data could be just as informative in medicine. Kale imagines doctors using large datasets to come up with hypotheses about puzzling cases. Instead of brainstorming with individual colleagues about a patient—which takes time—a physician could log on to a large database and access the histories of 100 similar cases. “You immediately get a cohort,” Kale explained. A computer program, armed with that dataset, might suggest tests to run or diagnoses to consider.

MUCMD attendees are taking it on faith, somewhat, that big data will make a difference for physicians and patients as it has for banks and retailers, Kale said. Many of the hoped-for outcomes—better treatment strategies and the like—remain unproven. Before physicians and patients see the results, medical data miners need to sort out how to obtain, store, access, and analyze the highly complex data stored in health records. For more on the successes and difficulties discussed at the meeting, see Part 2.—Amber Dance.

This is Part 1 of a two-part story. See also Part 2.


No Available Comments

Make a Comment

To make a comment you must login or register.


News Citations

  1. As ADNI Turns Four, $64 Million Data Start Rolling In
  2. PPMI: Parkinson's Field’s Answer to ADNI
  3. DIAN: Registry for eFAD to Chart Alzheimer’s Preclinical Decade
  4. NEALS: Sharing Among ALS Researchers and Participants
  5. DC: Shared Pain Is Lessened—Open-Trial Data Gain AD Model
  6. Big Data Present Big Challenges for Researchers

Other Citations

  1. ARF related news story

External Citations

  1. Meaningful Use of Complex Medical Data
  2. Alzheimer’s Disease Neuroimaging Initiative
  3. Parkinson’s Progression Markers Initiative
  4. National Alzheimer’s Coordinating Center
  5. longitudinal records
  6. Dominantly Inherited Alzheimer Network
  7. Northeast Amyotrophic Lateral Sclerosis Consortium
  8. Prize4Life
  9. Pooled Open Resource-Access ALS Clinical Trials database
  10. Coalition Against Major Diseases
  11. C-Path Online Data Repository
  12. diabetes prediction challenge
  13. offering $25,000

Further Reading


  1. . Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data. Neuroimage. 2012 Jul 2;61(3):622-32. PubMed.
  2. . Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm. Nat Biotechnol. 2011 May;29(5):411-4. PubMed.
  3. . Passive case-finding for Alzheimer's disease and dementia in two U.S. communities. Alzheimers Dement. 2011 Jan;7(1):53-60. PubMed.
  4. . Customization of normal data base specific for 3-tesla MRI is mandatory in VSRAD analysis. Radiol Phys Technol. 2008 Jul;1(2):196-200. PubMed.