2009  IDSA  Lyme  Disease  Review  Panel  Hearing   ALISON DELONG

Dr. Baker: So, I would like Allison Delong, please, to come to the podium for her presentation.

Allison Delong: I also have copies of my presentation if you don’t have it.

Dr. Baker: Yes, we have copies of all the presenter’s presentations.

Allison Delong: I didn’t know if the screens were small and I have tables. Let’s see first of all, I would like to thank you very much for inviting me to come here today. My name is Allison Delong and I would like to first also acknowledge my co-authors, Dr. Tao Liu and Barbara Blossom. Tao and I --

Dr. Baker: Excuse me. Can you bring the microphone down a little? Thank you.

Allison Delong: [Laughs] I’d like to acknowledge my co-authors, Dr. Tao Liu and Barbara Blossom. Tao and I are biostatistician at the Center for Statistical Sciences in the Department of Public Health at Brown University. Most of my work and much of Tao’s has concentrated on studying HIV/AIDS and other infectious diseases. Please refer to our written submission in the ILADS binder, but also don’t hesitate to contact Tao or me if you have any remaining questions about this presentation or, in fact, on the statistical analysis of any other study you review for the Guidelines. For this hearing, we present the results of a statistical review of the NIH-funded randomized controlled trials that examined re-treatment of Lyme disease following recommended therapy. Our findings are inconsistent with the description of these trials in the 2006 guidelines. Specifically, the objective of this presentation is to challenge the following two recommendations in guidelines. The first states that quote "Adult patients with late neurologic disease affecting the central or peripheral nervous systems should be treated with ceftriaxone ... Re-treatment is not recommended unless relapse is shown by reliable objective measures." The second recommendation is in the section on post Lyme disease syndrome and states: quote… "Antibiotic therapy has not proven to be useful and is not recommended for patients with chronic subjective symptoms after administration of recommended treatment regimens for Lyme

1 2009  IDSA  Lyme  Disease  Review  Panel  Hearing  

disease." This recommendation was given the highest level of evidence (E-1) and these two guidelines are inter-related because patients diagnosed with late neurological Lyme disease who have continued symptoms would be classified as having post-Lyme disease syndrome and re-treatment would not be recommended. Here is a copy of your IDSA’s grading system for ranking recommendations. As this table shows, the E1 grading level implies that there has been one or more randomized controlled trial that proved or provided extremely strong evidence that antibiotic treatment is ineffective. In addition though for E1 grading, it is essential that no trial has shown antibiotic treatment to be effective. To date there have been four NIH-funded randomized controlled trials to examine re-treatment of patients with a history of Lyme disease and persistent symptoms. I will refer to them as the Fallon, Krupp and Klempner studies. Of note is the Fallon study was published in full after the 2006 Guidelines were written. In these studies, all patients had a prior diagnosis of Lyme disease, had been treated but they still had symptoms 4-6 months following treatment. The Fallon study examined re-treatment of such patients. In addition, however, these patients were required to have measured memory impairment at baseline. Other than noting the small sample size, only 37 participants, Tao and I found no problems with their statistics or interpretation of their findings. Fallon found significant improvement in overall cognition at 12 weeks, which was un-sustained to 24 weeks. Regarding their secondary outcomes though, they found that among those whose baseline scores were worse, there was improvement compared to placebo in measures of overall physical health, and pain out to 24 weeks, and improvement in fatigue to 12 weeks. In addition, using the categorization defined by Krupp, which I’ll talk about next, they found sustained improvement in fatigue to 24 weeks. The Krupp study was similar to the Fallon study, except they included severe fatigue as an entry criterion instead of memory impairment. Again, the study was small with only 55 participants. The three primary outcomes where fatigue, mental speed, and clearance of the OspA antigen. Clinical improvement in fatigue was defined to be a .7 decrease on the fatigue severity scale. The researchers found a significant effect on fatigue, the top line of this table holds the results of an intention-to-treat analysis. 64% of the participants in the antibiotic arm, compared to only 18.5% in the placebo arm, had clinical improvement in fatigue at 6 months follow-up. Other fatigue measures were also significant. The Guidelines criticize the findings based upon possible problems with patient selection, loss to follow-up and potential unmasking of the study medication. Regarding patient selection, the Guidelines say that it is

2 2009  IDSA  Lyme  Disease  Review  Panel  Hearing  

possible that three patients in the placebo arm may not have met entry criteria for the study. Here is a quote from a popular text on clinical trials. An intention to treat analysis is quote "the analysis that includes all randomized patients in the groups to which they were randomly assigned, regardless of adherence with the entry criteria, regardless of the treatment they actually received, and regardless of subsequent withdrawal from treatment or deviation from the study protocol." The reasoning behind this quote is that the goal of an intention-to treat analysis is to obtain an estimate of the effect of treating the next patient needing care and such inconsistencies are expected to occur in a clinic. Loss to follow-up is the second concern raised by the Guidelines. If we look at Krupp’s results on fatigue, this is what we know with certainty. The rightmost column holds the seven participants who were lost to follow-up, two in the ceftriaxone arm and five in the placebo arm. If the two people in the ceftriaxone arm had not dropped out, one of the three rows in this table on the right would have been observed, the complete cases. Similarly, if the five had not dropped out of the placebo arm, one of these six rows would have been observed. Had no patients dropped out, one of the 3 x 6 or 18 datasets would have been observed. Krupp tested all 18 datasets and all p-values were less than 0.05. Therefore, loss to follow-up did not affect the study findings on fatigue. Krupp and the Guidelines mention there may have been problems with masking. As Krupp states quote "patients in the ceftriaxone group may have improved in fatigue because they were more likely to believe they were on active therapy." Using their published numbers, we found the percentage of participants thinking they were taking active therapy did not actually differ by arm. There is no quantitative evidence of problems with masking. Krupp made a mistake and instead compared the percentage correctly guessing treatment assignment. Changes in the mental speed outcome were not different between arms, either. Why? First, as Krupp states, the patients only had mild deficits. Second, as designed, this outcome only had 74% power. Finally though, and most importantly, prior work by Krupp and colleagues showed that a 25% improvement, which is the value they defined to be clinically meaningful in this trial, forces Lyme patients to perform better than healthy controls. Taken together, lack of significance on this outcome does not prove the treatment was ineffective. The points I just discussed, together with this quote copied here, should clarify any questions you may have about the validity of Krupp’s findings on fatigue. Krupp states, quote "The improvement in fatigue could be considered an encouraging finding in that future studies exploring other less expensive and noninvasive methods for treating severe fatigue might be

3 2009  IDSA  Lyme  Disease  Review  Panel  Hearing  

effective." Unlike the Fallon and Krupp studies, the final study we reviewed has substantial statistical problems that prevent its use in formulating treatment guidelines. Klempner ran two trials concurrently, one with IgG seropositive and one with seronegative participants. The outcomes were measured at baseline and days 30, 90 and 180. The trial was stopped early and in only a fraction of planned participants were enrolled in this study. The outcomes were the two summary scores from the SF-36, the physical components summary score or PCS, and the mental component score or MCS. The scores are numeric and the mean for the US population is 50. Lower scores imply worse health. The actual numeric values of the PCS and MCS scores have been linked to disease severity and quality of life. I’ll give examples of this shortly and Dr. Cameron gave examples of this too. Also, changes in the PCS and MCS scores have been linked to changes in quality of life. For instance, improvement of 5 points on the PCS translates into a 20% decrease in the percent of patients unable to work. From a statistical perspective, the appropriate statistical analysis of the treatment effect using these scores is a regression analysis for continuous, longitudinal data that adjusts for baseline, uses data from all follow-up time points, and considers any issues with non-random patient dropout. This is the most efficient analysis or the one that is most powerful, or most likely to obtain a significant result. The resultant treatment effects should have reduced bias, and as the example above shows, clinical meaning. For the sample size calculations and data analysis, Klempner categorized the participants as improved, the same, or worsened based upon their observed SF-36 change from baseline to 180 days. The cut points to the categories were 6.5 on the PCS and 7.9 on the MCS. The treatment effect they present is the difference between the arms and the percent improved or worsened, tested with a chi-square test. Compared to the statistical analysis I presented on the previous slide, the Klempner analysis was inefficient, probably biased, and had an unsatisfying measure of the treatment effect. Inefficiency is due to omitting the 30 and 90 day follow-up measures in the analysis and categorizing the continuous outcome. The results are probably biased because baseline scores differed by arm and because there was an inadequate assessment of the impact of the patients who were lost to follow up. The paper does not provide the number of participants who did not complete that 180-day SF-36 questionnaire. These patients were placed in the worsened group. The reader wonders how much of the worsened group was those lost to follow up. Finally, reports have described patients with Lyme disease to have symptoms that wane and wax through time. The effect of this is that participants could be in a relatively

4 2009  IDSA  Lyme  Disease  Review  Panel  Hearing  

good or bad period during follow-up compared to baseline and thus categorized as improved or worsened, but their overall health condition may not have changed. This emphasizes the need for the statistical analysis I presented comparing mean SF-36 changes by arm and it underscores the difficulty in interpreting Klempner’s categories clinically. After reading the Klempner study, we wondered how changes in the SF-36 summary scores related to clinically meaningful changes for the patient and this concerned led to a literature search. We found no studies on Lyme but this table lists five studies specifically designed to estimate clinically meaningful mean changes in SF-36 scores for patients with other severe chronic illnesses. In each study, measured improvements in the SF-36 were verified by significant improvement in validated clinical measures for the disease studied. These studies corroborate that changes from 2 to 5, those are the values in the table, are clinically meaningful. Two to five are much smaller than Klempner’s chosen cut-points of 6.5 and 7.9. So what happens when expected treatment effects are unrealistically large? It means that the trial was designed with a sample size that was too small. Small sample sizes resulted in inadequate power or ability to detect smaller, and as we showed, relevant treatment effects. As a result, such effects would not be found to be statistically significant. An interim analysis should, and did, result in the termination of the trial because of high likelihood that the statistical analysis would not reach statistical significance. Terminating a trial for futility does not prove the treatment was shown to be ineffective. Turning now to the results of the study, we see this table contains the average baseline scores on the PCS and MCS by trial. First, I’d like to point out that the baseline mental score in the seronegative trial was not the same in the arms. This could result in biased estimates using Klempner’s statistical analysis. Physical scores of 34 to 36 indicate a very unhealthy group of people. These scores are similar to those observed in patients with congestive heart failure or osteoarthritis and are worse than those observed in type 2 diabetes or recent heart attack. The mental scores were lower than the US population average of 50. Looking at these equations here though, we see that a mental score improvement of 7.9 points forces the average Lyme disease patient to perform better than the general US population. Expecting a mean change of 7.9 points on the MCS is unrealistic. It is expected that no significant treatment effect could be found. After a clinical trial is complete, you get an estimate of the treatment effect, confidence interval and a p-value and we can draw one of three conclusions. The third one in red font is relevant here. If the p-value is greater than .05, the result is insignificant in the confidence interval includes 0. However, if the confidence

5 2009  IDSA  Lyme  Disease  Review  Panel  Hearing  

interval also includes a clinically important difference, it is wrong to say the treatment is ineffective. In this case, case 3, stating that the treatment was ineffective is very likely an error, whose rate is unknown and cannot be measured. This is a type 2 error. With this in mind, we’ll turn to Klempner’s treatment effects. This table contains a treatment effects and confidence intervals copied from Klempner’s paper. We see, for instance, that the point estimate for the treatment effect on the PCS in the seropositive trial is a 3% difference in the percent improved, but the confidence interval runs from -19 to 24%. Any value in this interval is consistent with the data. The remaining 3 confidence intervals are also very wide and contain 0. Using only the assumption that the differences in SF-36 scores from baseline to 180 days are normally distributed, we mapped between treatment effects on Klempner’s scale and treatment effects on the scale of the SF-36. This is what the bottom table contains. For example, we see a difference in the SF-36 of 5 points corresponds to an expected 18% difference in the percent improved on the physical component and 13% on the mental component. We see that differences of 2 to 5 points correspond to expected differences in the percent improved of 7 to 18% on the PCS. In the top table, first column, we see that 7 to 18% are solidly within the 95% confidence intervals for both trials, and the point estimate of the seronegative trial, at 19%, is consistent with a difference of 5. A similar comparison can be made for the mental score. This shows that Klempner’s results contain clinically meaningful effect sizes. Since clinically meaningful differences are not ruled out, one cannot conclude that the treatment was ineffective. In their sample size calculation, Klempner expected a 25 seropositive study or 35% seronegative study difference in the percent improved. It is unclear why they expected different treatment effects in the two trials. Using the same method as on the previous slide, we calculated the mean SF-36 score changes necessary to expect a 25 or 35% difference in the percent improved. This table shows, for instance, that the physical scores would need to differ by 9.3 points by arm to expect a 35% difference in the percent improved. Differences in the mental score needed to be 9.1 to 12.8 to expect treatment effects of 25% to 35%, while 2 to 5 are clinically meaningful. The trial was designed to be able to detect only these large differences in SF-36 scores. The trial, as designed, was unable to detect clinically meaningful treatment effects. I bring this up to emphasize statistical reasons for stopping the trial. Designing a trial with expected treatment effects of 6.7 to 12.8 points on the SF-36, which are the values required to expect statistical significance, is not reasonable. A statistical review of this trial found many shortcomings. The trial had an

6 2009  IDSA  Lyme  Disease  Review  Panel  Hearing  

insignificant result which did not rule out clinically meaningful treatment effects. It was, in fact, underpowered to detect clinically meaningful treatment effects. Therefore, one cannot claim the treatment was shown to be ineffective. We believe the findings have little value in formulating treatment guidelines. The guidelines need to be changed.

Dr. Baker: I’m sorry I'm going to have to cut you off, you’re over your time, but I’ll help you by asking you how would you suggest, given your data and your assessment here, that the guidelines be changed?

Allison Delong: Well, first of all, I’d like to say that E-1 evidence against re-treatment does not exist. So, for instance, two well-designed and executed trials, Krupp and Fallon, have examined the effect of re-treatment among patients with persistent symptoms of Lyme disease following the IDSA’s current recommended therapy. Neither trial has proven re-treatment to be ineffective. In fact, both trials have findings that indicate benefits of re-treatment in certain subpopulations, among individuals with worse fatigue, worse physical functioning, and more severe pain. Both studies though, I’d like to say, were small, very small and more research is needed. That’s my response to your question.

Dr. Baker: Thank you and I would like to entertain questions for your presentation now, Dr. Delong, from the panel members and I’ll look this way because I’m prone to look to the right. Dr. Lantos?

Dr. Lantos: Thank you for the very thorough analysis of these papers. Most of your critique is directed at the treatment of raw data after it had already been acquired rather than the design and the execution of the study itself. Do you think further treatment of these studies and further additions of clinical practice guidelines would benefit from an independent review of the raw data?

Allison Delong: Yes. I mean both of these studies were small and they used a specific subpopulation of Lyme patients. I think Krupp and Fallon did a good job of analyzing their data. I can advance to something that I just find so, you know what, oh, intriguing. And this is one of the figures from Klempner’s study. And this is the seronegative. . This is a group that really isn’t

7 2009  IDSA  Lyme  Disease  Review  Panel  Hearing  

highly represented in the treatment trials. This is the seronegative patients and you can see at day 180 -- we have the top bars are improved, the middle is the same and the bottom is worse -- at 180 days -- so the bottom you can see on the right is the placebo and on the left is the antibiotic – right there. We really see that there seems to be perhaps something going on here. And we don’t just see it at day 180. We see it at day 90 and at day 30. So an appropriate statistical analysis may actually be able to detect a significant treatment effect. I was able, I had available to me the numbers in each of these categories. So I could do something called like an ordinal regression, it’s a row mean score test on these 180 day outcomes and my p-value was .16. If we did a one-sided test, treatment improves, it’s 0.08, it’s not 0.05. The chi-square test they used had a p-value of 0.34. So just a simple improvement on their method more than halves the p-value towards significant effect of treatment in this population. It would be a beautiful thing if these data could become available. I’d like to know who dropped out, why they dropped out, meaning who didn’t have the 180 day questionnaire, how much of this worsened group was comprised of those that dropped out and how that biases.I’d also like to say that in the placebo group, we see some people that improved. That means their score changed by 6.5 points. That’s just an artifact of the instrument. They did feel better at follow-up, but that doesn’t mean their physical health improved. If their symptoms are waning and waxing, they could be anywhere along that trajectory. It doesn’t mean their mean scores changed. An analysis of the means by group really would be a beautiful thing.

Dr. Lantos: Just to follow up my question briefly, I think what I'm really getting at is should we trust the raw data generated by these two trials because it sounds that your main critic is directed against the post collection treatment of the data and not the acquisition of the data itself.

Allison Delong: I know people have talked about the population studied and it was a diverse group. Someone said, some of the people came in with low scores and they had - some people had low scores, some people had high scores. We see in both the Fallon and Krupp studies that there was an interaction between baseline score and treatment effect. It’s probably happening here too. So those people that came in with worse baseline scores would be more likely to have

8 2009  IDSA  Lyme  Disease  Review  Panel  Hearing  

a treatment effect. I’m not sure if that answered your question. So, I think, at a minimum, these data could be used as kind of like a data mining tool to come up with potential hypotheses to test. It is the biggest trial. It has a lot of patients and it has patients that are not represented in the other studies. I think whoever analyses the data needs to sit down with the researchers and have some questions asked about exactly how the trial was run, any issues that came up during the trial, drop out, that type of thing.

Dr. Baker: Dr. Medoff?

Dr. Medoff: Just out of curiosity. I'm curious as to why you’re doing these studies and do you intend to publish your analysis?

Allison Delong: We do -- a publication is in the works. We think that the Klempner study is quoted quite a bit and I think it would offer clinicians and whatnot some insight to - especially the Klempner portion of this. And, yes, it’s in draft form and perhaps CID would be a really nice place to publish this. So we do, we do, have every intention of continuing along these lines.

Dr. Baker: Are there other questions from the panel? Dr. Sanders?

Dr. Sanders: Your analysis was to look specifically at the positive impact or the potential positive impact of the antibiotics. You didn’t look at the adverse events. You didn’t reanalyze the data on the adverse events from these trials, right?

Allison Delong: As a statistician, I looked at the numbers. But, I will say, clinical trials are rarely of the sample size needed to really quantify adverse events. So, I mean, you usually have when you have a new drug, you do post-marketing surveillance. I'm certain there must be data out there on adverse events of IV treatments for other things, for Lyme, whatnot, that could be combined together and to look at. I was thinking -- this is totally me being a statistician -- but looking at some sort of survival analysis, where you see if the hazard or risk of developing an

9 2009  IDSA  Lyme  Disease  Review  Panel  Hearing   10

adverse event changes through time, you know, the longer you’re taking IV or whatnot. But so many treatments have not been studied. All of these were IV therapy and that’s all you can say. They looked at, you know, two populations in this whole group. The people got Lyme, they were treated and they still have symptoms. We know a little bit about this group. We know a little bit about this group. We don’t even know if they return to their pre-morbid health. But this other group, we don’t know if treatment is ineffective or effective. We need to know why it’s effective in some groups. We need to know a lot, I think.

Dr. Baker: Thank you very much for your presentation. I'm going to take the chairs perogative, we will have a 5 minute break. It will be only 5 minutes, so Dr. Johnson, please be ready for your presentation.