As more drugs are entering the trials pipeline in ever-earlier-stage patients, the question of how to measure success is coming to the fore. The conventional model of measuring efficacy with dual cognitive and clinical scales such as ADAS-cog and ADCS-ADL in mild to moderate Alzheimer’s is widely considered too insensitive for the prodromal or particularly the preclinical phases of the Alzheimer’s disease continuum. The field is in the middle of a transition, where a handful of trial sponsors have already used the revised diagnostic criteria and tried different outcome measures, and others are watching them closely to see what works and what doesn’t. Meanwhile, the doors of research have swung wide open for biostatisticians and neuropsychologists to come up with altogether new tools. Weighting the strengths and weaknesses of outcome markers creates an alphabet soup of a conversation. CDR-sb, MMSE, NTB, FCSRT, RBANS—confused yet? For context, and to capture some of the buzz at CTAD, read our Q&A with Steven Ferris, who directs the Clinical Trials Program and co-directs the Alzheimer’s Disease Center at New York University. Ferris has been involved in clinical trials for all approved and many investigational Alzheimer’s medications, as well as development of tools to track cognitive function. Questions by Gabrielle Strobel.
Q: There was a lot of discussion at CTAD about using the CDR-sb as an outcome measure. How did aducanumab fare with it in the PRIME trial?
A: To state this fairly, you have to give context. Two clinical outcomes were reported for this trial. It’s not common that companies collect clinical outcome information or report it in a Phase 1b trial. Remember, Phase 1b is for safety and learning how high you can go with dose and still have it be safe. The other aim, typically, is to confirm target engagement, so you do either CSF or amyloid PET to look for changes presumed to represent a direct biological effect of Aβ immunotherapy. It was a little unusual to measure clinical outcomes, and by using the CDR-sb, MMSE, FCSRT, and an NTB they did quite a bit.
They did see a clinical signal and achieved p values, even though the study was not powered to see an effect on these measures. The 30 people they had per dose was fairly large for Phase 1b but fairly small for assessing clinical outcome; researchers usually have at least 100 people per group for finding the best dose for clinical outcomes. Lo and behold, they see something. The effects were surprisingly big for a field used to seeing a couple of points of difference on the ADAS-cog. The CDR-sb is not a sensitive measure in small trials. It is usually used in larger, longer trials of 18 months or more. Seeing a clinical effect was what gave these results so much buzz.
Q: Then is there a problem?
A: The problem is when you have small group sizes, it can happen that you think you get an apparent effect that might be statistically significant but could be due to chance. That is why there is usually not much emphasis on clinical outcomes in such a small study. That said, this was a perfectly appropriate study design. Nobody did anything wrong, it’s just that you are limited in the extent it meets criteria for a robust double-blinded trial.
Q: How about the dose response?
A: That made it more believable, certainly with the dose-dependent amyloid effect, and also the dose-dependent clinical effect as originally reported.
Q: Why do you think it was not robustly double-blinded?
A: Because you escalate the dose as you go along. You recruit an initial 40 people for the first dose. You randomize to active or placebo. The groups are not evenly distributed between treatment and placebo because the trial’s primary goal is safety, so you want the majority of people to be exposed to the drug. They had 30 on drug and 10 on placebo. If you see no big safety problems, you enroll another 40 and increase the dose, again at 3:1. You do this sequentially. In a parallel three- or four-arm trial, i.e., Phase 2b and 3, you are randomizing at the same time to the planned distribution. It is a more robust blinding and minimizing of possible imbalances or other confounds. Here it was a Phase 1b dose titration safety study, that is why they did it this way. One possible implication of this design is that it reduces the certainty that what you see clinically is real.
For example, when you have a 3:1 ratio between active drug and placebo, it is not a totally blinded study in the sense that the patients and their caregivers know the design, and the people evaluating them know the design, and their expectation is that more people are on drug. So when you do a CDR rating, which is subjective, there could be some bias that happens when you are not completely blinded.
Q: What happens as the trial goes on?
A: A while later, sites do the next dose and the next dose. They know that the more recently recruited patients are likely to be on a higher dose. We saw at AAIC that the dose response conclusion got a bit weakened when they added a fourth dose later. It was an intermediate dose and did not follow the dose-response pattern. So there could have been something about the serial nature during the first three doses that made some people think they saw benefits. I emphasize again that I am not saying anybody was doing anything wrong, just that the nature of this design is a little bit prone to bias in the CDR rating. There could a subliminal expectation that the person in front of you is in a treatment arm.
Q: How about the amyloid PET?
A: That is not subject to that sort of bias.
Q: Are dose-titration designs the only concern with using CDR-sb as outcome?
A: No, my concern about the CDR-sb as an outcome measure is broader. If I asked 20 of my colleagues if they would pick the CDR-sb as the single primary outcome for a big pivotal trial in AD—particularly a trial with a mix of MCI stage and mild AD stage—most would say no. It strikes me that these next trials have been set up with some risk due to a less than optimal primary outcome.
A: The CDR-sb is not optimally sensitive to a treatment effect in this particular cohort that Biogen, and also Roche, have selected for their aducanumab and gantenerumab trials, respectively. The reason is that in this very mild group, there is not much scoring on the CDR-sb. It has six boxes that are rated, and a total score range of zero to 18. The lowest point is zero. People with a CDR 0.5—the population these trials are recruiting—usually have a total sum of boxes score below four at baseline. So they simply have not much room to get better. You want a scaling where the typical individual in your trial is in the middle of the range, so they have lots of room to change if they get worse or better. Also, it’s a rather coarse scale and will not pick up small changes. The CDR is a great dementia staging tool, but the CDR-sb is a relatively poor outcome measure for prodromal or MCI-stage therapeutic trials.
Q: What led to its use, then?
A: Historically, the regulatory authorities have valued performance-based cognitive outcomes such as the ADAS-cog. The ADAS-cog led to registration for drugs in mild to moderate AD but is relatively insensitive at earlier stages. Moreover, with their guidelines from the early 1990s, the FDA started requiring a dual outcome. In her talk at CTAD, Suzanne Hendrix explained to us that we pay a statistical price with a dual outcome. You have to win on both, and to have a required p level of 0.5 for one and for the other, that raises the bar. Since in the past decade no drugs succeeded, the FDA recently put out a signal in their guidance that at the early AD stage, they might consider a single primary outcome instead (Kozauer and Katz, 2003; Mar 2013 news). In the guidance, they almost off-handedly said, “... such as possibly the CDR-sb.” The CDR-sb had performed well in ADNI, tracking decline across the very early disease continuum. Still, I do not think the regulators adequately thought through the use of CDR-sb as a single trial outcome. Not only does it lack sufficient dynamic range at the early AD stage, it also negates the well-established FDA standard requiring inclusion of a performance-based cognitive measure. The CDR-sb is a rating scale, where the rater interviews an informant and the patient. It does not meet the old requirement that you show a benefit on an objective psychometric test.
Q: So the companies went with that?
A: They are understandably jumping at the opportunity of having a single rather than a dual primary outcome, which is certainly advantageous statistically. The problem is they put all their eggs in a less-than-optimal outcome basket.
Q: What would make a better outcome?
A: At CTAD, Suzanne Hendrix made a very strong case for the advantages and improved sensitivity of a composite that can include both clinical and cognitive performance components. That is what Biogen would be well-advised do, in my opinion. They could include the CDR-sb, and also some cognitive tests, maybe some ADLs that change early, and build a composite from that.
Q: These trials have started. Can they still change course?
A: They cannot add assessments once they are enrolling. Whatever is in the protocol as primary, secondary, and tertiary outcomes is what they have. The total array of assessments they collect is very good. However, they can still change their minds about what they are using as the primary outcome, provided they only include in their composite data that is already being collected. They have to justify it to the FDA and register their revised analysis plan before they lock the data at the end of the trial. Ideally, given what we are learning now, they could set some statisticians to work and figure out the best composite using the measures the trials are collecting. They probably have at least a year to get that done. I hope they will. We all want new treatments to succeed.
Q: Has it been done before?
A: Yes. I believe Lilly changed their primary outcomes with solanezumab, for example.
Q: How are other drugs performing on the CDR-sb?
A: That is only beginning to come out. The gantenerumab ScarletRoAD trial fared poorly with it. Some of their subgroup analyses reported at CTAD showed differences on other outcomes but not for CDR-sb. The Merck verubecestat trials have not reported any clinical results as yet.
Q: What do you suggest instead?
A: It’s worth considering that several other trials in early AD have opted for a composite. Eisai is using the ADCOMS composite they developed for both their antibody and BACE inhibitor (see BAN2401, E2609). Some preclinical AD trials are also using composites. The A4 trial, the planned API ApoE4/4 trial, and the DIAN trial each have developed a composite that they think is sensitive in their respective populations.
Q: Other outcomes are being used, as well. In the PRIME study, there was no change on a modified NTB and the FCSRT. Apparently there were floor effects, indicating these measures are unsuitable for this population.
A: I am not sure why there would have been floor effects. I have long pointed out that there is not a single NTB. It started with bapineuzumab, where John Harrison made a battery and called it NTB (Harrison et al., 2007). Now many different variations are called NTB; it has almost become a generic name for batteries that can be quite different, so you have to look at what tests are included. Still, you usually should not have floor effects on these batteries in MCI-stage and for mild AD. Floor means everybody is hovering around zero scores and can barely do the test. An NTB was used in the bapineuzumab trial in mild to moderate AD. People with MCI generally show significant impairment on an NTB, so I would not expect ceiling effects either.
Q: What about the FCSRT? It has been used in some prodromal/MCI-stage trials.
A: The FCSRT is a newer, fairly complicated verbal list learning task (Grober et al., 2010). It taxes the hippocampal-limbic network, which is impaired early on. Here, too, there should not be floor effects in an MCI population. The FCSRT has 16 words and usually three learning trials, so the maximum recall score is 48. MCI participants tend to recall at least a few words on each trial, for a score of 10 to 20. It surprises me that those two measures show no treatment effects but the MMSE would. The FCSRT is a single test, the NTB is a cluster of four or five tests, and both should be more sensitive than the MMSE to gradations of impairment in MCI.
Q: At CTAD, I heard some agreement when Paul Maruff of Cogstate reported that the MMSE adds mostly noise to composites, at least when tracking progression in the longitudinal AIBL data. Others said only the orientation items from the MMSE are useful, and have pulled those, for example into the Banner/API composite. What do you think about including the MMSE in composites?
A: I think it’s a bad idea except for orientation, because that taps a different domain from everything else. Other components of the MMSE are close to ceiling in this very mild population, so you just get noise. The likely reason people with preclinical or prodromal AD fail MMSE questions is they are not paying attention to their environment.
Q: But it’s used everywhere.
A: The MMSE has been widely used for decades simply as a screening tool for dementia. We use it at baseline and refer to a score range to broadly characterize severity. It’s almost like a language: a “minimental 24” is a person who is doing pretty well with mild impairment; a “10” is a profoundly impaired individual. But it is not a good outcome measure in a trial. That is why back in the 1980s everyone thought we needed something better and the ADAS-cog was created.
But now, a lot of people further dislike using the MMSE because it got commercialized very post hoc. It was created by the academic investigators Susan and Marshal Folstein, who published it 40 years ago (Folstein et al., 1975). It became the main first-pass instrument a neurologist or psychiatrist would use in an office visit to evaluate a possible AD patient. Then in 2001 copyright was passed on to a test company, which now demands payment every time you administer it. That is a major annoyance in the academic clinical community, and indeed its use by the 30 ADRCs has been phased out mainly because NIH does not expend limited resources on the licensing fee.
Q: I have been hearing a lot about RBANS at CTAD. What’s going on?
A: The RBANS is a psychometric battery developed by the neuropsychologist Chris Randolph (Randolph et al., 1998). It has been available for a while, but has been rediscovered because people want an alternative to the ADAS-cog for trials in milder populations. Some international trials are planning to use the RBANS, for example the European EPAD initiative.
Q: Another buzzword at CTAD was MedAvante?
A: Full disclosure: I am a paid member of their AD advisory board. They started by providing data quality services for trials in psychiatric disorders and then branched out into Alzheimer’s. They offer services related to the RBANS among other cognitive batteries and trial measures, including the ADAS-cog and CDR-sb. But rather than selling use of tests, their main business is contracting with pharma to improve data quality and minimize data errors in clinical trials. They offer rater training, and review and scrutiny of data either in real time or shortly after collection. They have an electronic system for querying possible data problems and ensuring corrections. If a rater is not properly administering a scale, the system picks that up and tries to fix it during the period the data is being collected.
It is all about minimizing noise in a trial. In the past, CROs have done this and, frankly, in many cases they have done it badly. MedAvante specializes in this and has sophisticated systems to optimize the accuracy of outcome measures. For example, with their centralized rater system for psychiatric trials, a small group of highly trained, specialized raters rate everyone in a multicenter trial by video conferencing. A problem with raters is that every site uses different ones, often with different training and uneven quality. One rater may score low, another high, or a rater scores lower one day and higher the next. All this increases “noise.” With a whopping drug effect, this does not matter, but if you have a modest drug effect, the large standard deviation can overwhelm your group difference, and the only way to achieve a required p value is to minimize error.
Q: What’s the difference between MedAvante and a CRO?
A: The CROs will continue to run the nuts and bolts of trials. But optimizing data quality is increasingly being transferred to companies like MedAvante from the old system of relying on CRO monitors, who used to visit sites and look at paper data records and have sites make corrections manually, after the fact. Now that most trials have adopted electronic data capture, companies like MedAvante play an increasingly important role.
Q: In your view, what are the main challenges in AD psychometrics these days?
A: We need more sensitive but robust clinical outcome measures for early stage trials. We need cognitive tasks that specifically tap early impairment in those brain areas and networks that have pathology early on, and that are validated against relevant AD biomarkers in the same types of subjects. Such tests could tell which individuals among cognitively normal people within a certain age range are most likely to have brain amyloid or tau pathology. The preclinical trials are here, and they are the new frontier. We need more sensitive tests that enable screening of large numbers of people cheaply, so that only people who score poorly get a PET scan or CSF assays to confirm likely “preclinical AD.” That would bring down the current 70-80 percent screen-failure rate of the A4 trial, screen-failure meaning a person is ineligible for the trial because a $4,000 PET scan indicates they are amyloid negative. And importantly, these same cognitive tests would probably also be optimally sensitive measures for tracking treatment response in a preclinical AD trial.
Q: Thank you for this conversation.
- Kozauer N, Katz R. Regulatory innovation and drug development for early-stage Alzheimer's disease. N Engl J Med. 2013 Mar 28;368(13):1169-71. Epub 2013 Mar 13 PubMed.
- Harrison J, Minassian SL, Jenkins L, Black RS, Koller M, Grundman M. A neuropsychological test battery for use in Alzheimer disease clinical trials. Arch Neurol. 2007 Sep;64(9):1323-9. PubMed.
- Grober E, Sanders AE, Hall C, Lipton RB. Free and cued selective reminding identifies very mild dementia in primary care. Alzheimer Dis Assoc Disord. 2010 Jul-Sep;24(3):284-90. PubMed.
- Folstein MF, Folstein SE, McHugh PR. "Mini-mental state". A practical method for grading the cognitive state of patients for the clinician. J Psychiatr Res. 1975 Nov;12(3):189-98. PubMed.
- Randolph C, Tierney MC, Mohr E, Chase TN. The Repeatable Battery for the Assessment of Neuropsychological Status (RBANS): preliminary clinical validity. J Clin Exp Neuropsychol. 1998 Jun;20(3):310-9. PubMed.
No Available Further Reading