This is Part 2 of a three-part series. See also Parts 1 and 3. Read the entire series [.pdf].
19 November 2009. The first generation of CSF biomarkers for Alzheimer disease comprises those that measure the component proteins of the disease’s signature pathologic lesions. They have been explored for almost 15 years in some 40 published studies, and the broad consensus is that they basically work. “There are more and more data showing that Aβ42 and tau are good diagnostic and predictive markers to identify AD very early,” said Kaj Blennow of Sahlgrenska University Hospital near Göteborg, Sweden. The catch is that most of these studies are done at single centers on patients from the same region, and all analyses for a given paper are typically done on one batch of assay kits. When scientists compare different centers doing things their way, they see a large variation. Even at a single site, variation occurs, such that internal controls run alongside the study samples can show a marked drift over time, Blennow said.
For a single site’s day-to-day clinical testing, the situation is quite workable, scientists interviewed for this article agreed. “If a person comes in and gets a lumbar puncture, we know nearly for certain based on this CSF measure whether this person has AD or not,” said Niek Verwey of VU University Medical Center in Amsterdam, the Netherlands. “Within our hospital the test is very good, and within some other hospitals it is also very good, but when you compare one hospital to another, it is not good anymore. And that is a serious problem.”
That sites vary in CSF measurements already came up in an early meta-analysis (Sunderland et al., 2003). Since then several groups have directly compared test performance in more depth, and their work has revealed that variation comes from many sources. For example, a three-country survey by Jens Wiltfang, then at the University of Erlangen-Nürnberg, Germany, noted that the consistency within a given assay (i.e., the same assay run today and again tomorrow on the same samples) was low as specified by the manufacturer, ranging from 7.5 to 3 percent for Aβ42, total tau, and p-tau assays. But between labs, those percentages were 29, 26, and 27 percent, respectively (Lewczuk et al., 2006). The Amsterdam group led by Rien Blankenstein ran a larger comparison of 13 centers in 2004 and 18 centers in 2008. The Dutch scientists sent the same test samples out to participating labs across Europe and the U.S., and each lab analyzed the samples with the assays it uses on site for their own clinical and research purposes. The results were concerning. As the centers gained more experience with CSF testing between 2004 and 2008, results on tau did improve somewhat from a 21 to 16 percent variation; but for Aβ42 variation widened from 31 to 37 percent between-center difference. That was partly because centers use different assays. When the scientists restricted their analysis to the most widely used test among the 18 centers, the Innotest ELISA from the Belgian company InnoGenetics, mean variation improved from 30 percent in 2004 to 22 percent in 2008, thanks to some standardization. However, that is still too high for large-scale multicenter and drug studies, said Verwey. In addition, this study showed that variation is about as large within a given center as among different centers (Verwey et al., 2009).
Leslie Shaw of University of Pennsylvania Medical Center in Philadelphia has led an international comparison among seven sites as part of quality control for ADNI1. This round robin found significantly lower variation of the test used in ADNI1, near the needed 10 percent range (Shaw et al., 2009; ARF related ADNI story), and this is now generally viewed to be the CSF measurement variation in ADNI1. In this round robin, assay kits were shipped to the participating labs along with CSF samples, meaning the seven labs were using not only the same test, but also kits from the same production batch, said Blennow. This captures analytical variation arising from how different labs actually perform the test, but it misses differences between one batch of a given test and the next; nor does it capture differences among the different types of assay that are routinely used in different cities. Shaw’s study, as well as a report by a German group on a European pilot study of the ADNI protocol, has created some confidence that variation in multicenter studies may be controllable if participating centers ship their study samples to a central reference lab that is highly versed in standardized procedures (Buerger et al., 2009). However, pharma researchers have cautioned that for a CSF test to gain regulatory approval, generally speaking it must perform robustly in many routine settings.
The Three Sources of Variation
“If I am an AD patient and I go to Amsterdam and have a lumbar puncture, my Aβ is 500. If instead I go to London, it is 400. It will be different again if I go to Boston and again in Chicago. Why is that?” asked Verwey.
The reasons fall into three categories: the samples, the analysis, and the assays, scientists said. The first concerns how samples are collected and handled prior to actual protein measurement. This can vary in numerous ways: from whether the sample is frozen and thawed multiple times before being analyzed or analyzed first after a single freeze-thaw cycle, to when and how it is centrifuged, how long it is stored, down to what kind of containers hold the CSF, and more fine points like this. Scientists showed that Aβ sticks to polystyrene tubes, necessitating the use of polypropylene tubes. These seemingly persnickety details can jinx the protein measurement (Schoonenboom et al., 2005), and ideally they should be performed in exactly the same way at every participating site. Another source of variation comes from what time of day people undergo the spinal tap and whether they have eaten. For example, research has shown recently that CSF Aβ levels fluctuate over the course of the day in healthy people, though that pattern seems to wane in AD (Bateman et al., 2007); early morning spinal taps control for this source of variation.
Sample collection and handling have been an active topic of discussion for some time (e.g., see reports of Antecedent Biomarker Working Group). The issue has caused its share of hiccups, such as prompting a re-run of baseline in ADNI1 (see ARF related news story). However, by now many of the kinks appear to have been largely ironed out, and the protocol to be used for the QC initiative reflects best practices, said Maria Carrillo of the Alzheimer’s Association in Chicago. A separate kind of variation even upstream of sample handling stems from clinical differences among centers, for example, how they classify mild cognitive impairment (MCI) and what ages of patients they include (Mattsson et al., 2009; Petersen and Trojanowski, 2009).
The second category of error arises from how people actually perform the testing itself, Blennow said. There are many small ways in which one analyst’s procedure differs from another’s, though overall, sites tend to get more skilled at running these tests over time. In the October 15 issue of Clinical Chemistry, Cees Mulder, Verwey, and colleagues from the Amsterdam group reported that as they monitored their own site’s performance over the course of six years, their results became more stable in the second half as they gained more experience (Mulder et al., 2009). Like the Göteborg group, the Amsterdam group routinely provides CSF testing for external healthcare providers in the country.
To get a close-up view of this second source of differences, the Amsterdam group invited technicians—the folks who actually run the tests their lab chiefs then present at conferences—to a workshop of side-by-side, elbow-rubbing CSF analysis. On October 19 and 20,, 26 analysts from 17 different European centers did exactly that at the VU’s Alzheimer’s center. U.S. labs had received invitations but were unable to send a representative, either for lack of funding or because they use a different assay from the one used at the workshop, Verwey said.
Two analysts who did not know each other paired up into 13 groups; one analyst conducted a widely used Aβ42, tau, and p-tau ELISA on three separate CSF pool samples, while the other watched, took exact minutes of each procedural step, and discussed the differences. The technicians received a protocol based on the manufacturer’s publications but otherwise followed the procedure they use at their home institution. Everyone analyzed the same samples in the same lab at the same moment and the same temperature, using the same assay batch and the same reagents. By holding all these variables constant, the workshop isolated for detection intra-assay, interpersonal differences inherent to analytical procedure. “It was fantastic fun to do this with an enthusiastic group of people who don’t get to travel to meetings very often,” Verwey said.
VU’s statisticians are still working out the results, but already during the workshop, it became clear that individual technicians do things differently in myriad little ways. Here’s a partial list: Some people use a second, transfer ELISA plate that comes with the test kit while others do not; some use all given dilution steps in the ELISA standard line, others skip a dilution they consider superfluous. “People don’t necessarily follow all the steps of the manufacturer’s instruction,” Verwey said. There’s more: People used different amounts of sulfuric acid to stop the ELISA reaction, some use reverse pipetting while others don’t; some shake the plate while others don’t. Some people leave samples on the table between steps, others put them in the fridge; some incubate the ELISA plate at 25 degrees, others at room temperature. “We spotted some 20 points of difference, and we’ll work on synchronizing these procedures,” Verwey added. The Amsterdam group is a reference site in the Alzheimer’s Association-funded QC program, and is working collaboratively with its leaders in Sweden.
The third category of measurement error appears to arise from the assays themselves, several researchers pointed out. On the perhaps most frequently used test, the Innotest ELISAs, several academic groups have reported that the intra-test variability at their site is low within a given production batch. But they have also noticed that the performance of a given test changes from batch to batch. ELISAs contain monoclonal catching and detecting antibodies that are typically generated in hybridoma cell culture. In producing a new batch, slight differences in the medium and other culture conditions, concentration, or purification of the antibodies—even in the reagents and plastics the manufacturer purchases for ELISA production—could all lead to batch-to-batch variability. This is published in a longitudinal study of CSF tau measurement (Verwey et al., 2008). It introduces uncertainty: Did Joe Smith’s tau readout go up a notch because there is more tau in his CSF this year than last, or because the lab used a different assay batch? In an interview, Verwey told ARF that Blankenstein, who heads Clinical Chemistry at the VU Medical Center, purchased enough ELISA all at once for the year 2009. “The results were very stable at the hospital throughout 2009. I think it’s probably because we used a single batch number,” he said, adding that his group will continue to study this issue in the future.
Confirming this observation, Blennow noted that the external QC initiative, by demonstrating the long-term performance of a given ELISA across production lots, will encourage companies to ensure that one batch of a given test corresponds completely with the previous one. “Not only sample collection and analysis will become more standardized. In time, assay production will, too, and both site and batch differences will become smaller,” Blennow said.
Batch-to-batch changes may explain, in part, why the cutoff values that Alzheimer’s centers calculate to decide whether a person has AD or not have shifted in recent years. In Amsterdam, the cutoff for Aβ42, for example, over the past six years has crept up from 450 ng/L to 500, and is now approaching 550, said Verwey (see also Mulder et al., 2009). Creeping cutoffs could pose a problem for longitudinal studies. More broadly, the current situation where each center at present has to set its own cutoff due to the center-to-center variation is a challenge for multicenter studies, as well.
The CSF assays puzzle researchers in other ways, too. For example, they don’t show dilution linearity, meaning that if a given CSF sample measures in at 500 ng/L Aβ, diluting the sample by two will not generate a reading of 250 ng/L. On the contrary, the reading goes up, and then drops with further sample dilution, Verwey said. This may reflect how finicky and dynamic a peptide Aβ42 is.
In the past three years, a growing number of laboratories have switched to using a newer, multiplex test that captures readings for Aβ42, tau, and p-tau simultaneously in one run. Called INNO-BIA AlzBio3, this test generates different absolute values on a given protein from the corresponding Innotest. For example, the same sample that generates a 470 reading in the Innotest may generate a 160 reading in the Alzbio3. That in itself is not unusual, or troublesome. However, besides large differences in absolute CSF levels between these two methods, the scientists reported a lack of linearity between the assays (Verwey et al., 2009). This indicated to them that the differences cannot be attributed solely to standardization, and that comparison of these two methods is not useful. “The tests do not correspond tightly. There is no parallelism,” Verwey said.
This can be a problem for multicenter studies if different participating sites use different tests and all results are to be analyzed in one large database. Such studies would be well advised to choose a test in the beginning and stick to it, Blennow said. In addition, study sponsors could consider running a Swedish QC sample alongside whichever company kit they use.
When used clinically to distinguish AD from control, the AlzBio3 works well, scientists interviewed for this article agreed. In other areas of clinical laboratory practice, for example, troponin T measurement following a heart attack, both high-sensitivity and lower-sensitivity tests are helpful so long as each is used with its own reference range, Zetterberg added.—Gabrielle Strobel.
This is Part 2 of a three-part series. See also Part 1 and 3. Read the entire series [.pdf].