The Wald method has been the subject of extensive criticism by statisticians for exaggerating results”

Paul Nee sends in this amusing item:

MELA Sciences claimed success in a clinical trial of its experimental skin cancer detection device only by altering the statistical method used to analyze the data in violation of an agreement with U.S. regulators, charges an independent healthcare analyst in a report issued last week. . . The BER report, however, relies on its own analysis to suggest that MELA struck out with FDA because the agency’s medical device reviewers discovered the MELAFind pivotal study failed to reach statistical significance despite the company’s claims to the contrary.

And now here’s where it gets interesting:

MELA claims that a phase III study of MELAFind met its primary endpoint by detecting accurately 112 of 114 eligible melanomas for a “sensitivity” rate of 98%. The lower confidence bound of the sensitivity analysis was 95.1%, which met the FDA’s standard for statistical significance in the study spelled out in a binding agreement with MELA, the company says.

The binding agreement between MELA and FDA covering the conduct of the MELAFind study required the company to analyze the data using a statistical method known as the “exact mid-P” test.

Clearing the 95% hurdle for the lower confidence bound using the mid-P test is important because that means if the phase III study were repeated, there would be a 5% chance or less that MELAFind’s sensitivity to detect melanoma would be below 95%. . . .

In its report this week, the research shop BER says it was unable to reproduce MELA’s positive results when it ran the MELAFind data independently using off-the-shelf statistics software.

“For the protocol-specified outcome group, the LCB [lower confidence bound] using the exact mid-P method is 92.64% — a far cry from the reported 95.1% lower confidence bound,” the BER report states.

BER then ran the MELAFind data analysis again using a bunch of other statistics methods. What BER discovered was that MELAFind’s publicly stated statistical results matched only when the “Wald normal approximation” method was used.

Hmmm… “Exact mid-P” doesn’t sound so great to me either. Perhaps this could all be put into some decision-theoretic framework?

21 thoughts on “The Wald method has been the subject of extensive criticism by statisticians for exaggerating results”

  1. More than just a bad quadratic approximation going on here – and it has to be the simplest problem in statistics – _only_ a single unknown parameter and a random sample!

    Real nice paper and R code from Micheal Fay on this simplest problem in statistics at http://journal.r-project.org/archive/2010-1/RJour

    As for mid-p type adjustments, I spent a lot of time discussing this with Clinical researchers in the 1990,s. As mid-p breaks the rejection regions where the likelihood is constant – they are equivalent to post-randomized tests. That is based on the data, now flip a coin if tails less than .95 if heads greater than .95.

    When they grasped this – they decided it was fine. You need _rules_ and although this randomization is arbitray – its beyond the control of the investigator. (In particular the technique we were looking at, this randomization was based on early versus late trends in the data and assuming there was no trend – that made it random).

    K?

  2. I'm always curious about how much referees check the numbers in published journal articles. It'd seem to be impossible to check all the tests, especially if there is not a protocol to supply all the underlying data. And if the referees had to check everything, it'd be like re-doing the entire analyses. But if that means we give up on checking, then it would be extremely easy for anyone to fake results. Perhaps someone should do a random sample of studies to figure out how prevalent false reporting of p-values is in the literature.

  3. Kaiser: I fear its quite high. There was a study of a convenience sample of science journals (reference not at hand) that calculated the percentage of articles with numerical inconsistencies – numbers in the paper that did not add up – and even that was alarmingly high – I think 10 or 15%.

    I once tried to get funds to study this and got rather nasty reviews the most telling perhaps being – "I used to have reproducibility problems in my group – but I have fixed it."

    Now with regulatory agencies and especially the FDA – all the data has to be supplied and all or most of the analyses are re-done.

    K?

  4. I'd be for some form of random audits and reviews of academic research, which review lab performance from data collection to reporting.

  5. Yay mid p-value. The tests in discrete distributions can not maintain a type 1 error exactly equal to there level. The standard exact test takes the position that the error should be less than the level. The mid p-value chooses the rejection region such that it is approximately equal to it's level. Thus, though the mid p-value may have error greater than alpha, its error will always be closer to alpha than the standard exact test.

    The mid p-value is a very principled method to use, especially when the goal is to create confidence intervals.

    An article I wrote about the mid p-value under the 'estimated truth' decision theoretic framework:
    http://www.informaworld.com/smpp/content~content=

    Also, for those interested, check out:

    A. Agresti and B. A. Coull. Approximate is better than ”exact” for
    interval estimation of binomial proportions. The American Statistician,
    52(2):119–126, 1998.

    Which gives good justification not to use the wald interval, or the standard exact interval.

  6. I have no idea what tests should be used. But I disagree that "92.64% is a far cry from 95%." I actually thing these are quite close! i realize you have to draw the line somewhere, and if the line is at 95% then I guess these people didn't make it. But it doesn't seem like the worst thing in the world if this one slips through to the second round.

    Also, it seems a bit silly to focus on a derived statistic like this when the raw numbers are so easy to understand. They detected 112 out of 114 melanomas. Apparently that's not good enough. Why not just say "the test must detect at least 113 out of 114 melanomas" in order to progress to the second round"? I guess I'm saying this: if I understand the problem correctly, then for any number of melanomas there is a number that must be detected in order for the test to progress to the next stage. In this kind of situation, rather than specifying a specific test, the FDA should provide a table: You can detect 90/90, 91/91, …, 111/112, 112/113, 113/114,… and so on. Seems a lot simpler this way.

    Overall I am disappointed that so many people focus on "statistical significance" and I agree with Andrew that rather than being based on lines in the sand, decisions should take more information into account. If a test or a drug is cheap, painless, and has few side effects, it should be approved on less evidence of benefit than one that is expensive painful, and loaded with negative effects.

  7. hi phil, i completely agree that, in general, too much is made of statistical significance. however, regardless of this, the article possibly raises other relevant issues regarding this diagnostic test, questioning its usefulness. i'll quote them here:

    "MELAFind's 98% accuracy in detecting melanoma "skin cancers (the device's sensitivity) was enhanced by the fact that all the lesions entered into the study were initially flagged as being possibly cancerous by dermatologists. This stacked the deck in MELAFind's favor, making it easier for the device to accurately diagnose melanoma."

    "MELA argues that MELAFind is still more accurate than the trained, experienced eyes and diagnostic ability of dermatologists. Yet the company has no data to prove that MELAFind's sensitivity is greater than that of dermatologists who participated in the study."

    "Despite claims to the contrary, MELAFind does not significantly reduce the number of biopsies performed by dermatologists."

  8. Phil – for most the line in the sand is just the starting point rather than the end of it. It sets the tone and breath of that more information that should be taken into account.

    K?

  9. "Clearing the 95% hurdle for the lower confidence bound using the mid-P test is important because that means if the phase III study were repeated, there would be a 5% chance or less that MELAFind's sensitivity to detect melanoma would be below 95%."

    1. Ugh.

    2. They're not actually testing at alpha = 0.05, are they? I got a higher normal lcb (0.962 one-tailed, 0.958 two-tailed) when I tried to reproduce this, but I didn't get much sleep last night.

    3. I have the feeling that an exact test is the right thing to do for a one-tailed test (given the weird set-up of the problem) but would have to think about how to justify this.

  10. K? and Phil: I side with K? on this one. While it's an easy target to say 95% or any other number is arbitrary, it's better than having no standard, which would make it open season for anyone to claim whatever they want. And yes it's a starting point, not an endpoint.

    If a binary decision has to be made, regardless of how you come up with the decision, there will be an effective cutoff threshold for the p-value so I don't believe there is really any method that is immune from the "arbitrary cutoff" critique.

  11. @jimmy, I guess I went too far when I said it doesn't seem all that bad if this sneaks through to the second round even though it failed by a whisker: the full article makes it sound like the trial is stacked in the first place. But that's a different issue from what test to apply and so on. I was just commenting on the material in this blog post, which is about determining the "sensitivity" of the test. The company claims that 112/114 meets the criteria threshold, whereas BER claims that it missed by a mile because they "only" got a 92.5% lower confidence bound rather than 95%. If this is correct, then the only way they could have gotten the requisite confidence bound would have been to detect 113/114 melanomas.

    It's OK with me if the FDA decides that the test has to go 113/114 in order to gain approval. But I think that if they're going to do that, they may as well say it that way: "If there are N melanomas in the sample, the test must detect at least k of them, where k can be determined from N by looking at the following table." It seems a bit silly — not necessarily wrong, but needlessly complicated and obscure — to instead say that they have to achieve a certain lower confidence bound by applying the exact mid-P method. The latter and the former contain the same information; it's just that one is simple and easy to understand, and the other isn't.

    @K? and @Kaiser, you say that for most people there isn't a line in the sand, the calculation is just the starting point, not the endpoint. You may be right, but in THIS case there is a line in the sand. It is indeed "an endpoint", if I understand correctly: since they didn't achieve the required number, they're done.

    You are right, though, that whatever quantitative method you use, you have an "arbitrary" cutoff somewhere. (In fact, Andrew and I wrote a decision analysis paper that I still think is quite good, where we analyzed a decision in which a whole bunch of continuous distributions and uncertainties come into what ultimately has to be a yes/no decision). But I don't think that, in most real-world cases, it makes sense to draw that line based on only a single parameter. In this case, the lower edge of the confidence bound seems to be the only thing that matters in deciding whether to approve of more tests! Shouldn't the cost of the test matter too? And the inconvenience? And the speed? And the side effects? I think they should.

  12. @Phil,

    The FDA is saying that you must demonstrate that the device has better than 95% sensitivity using a principled statistical test at alpha=.05. The company is the one that chose the mid p-value as their metric. In my view this is a good choice. The Wald test is bad generally, and especially bad near 0 and 1. A score test would be better than the Wald, but some form of mid-p is optimal (in my mind). A standard exact test would have also been more than acceptable to the reviewers (I imagine).

    The bar of proof set by the FDA seems pretty simple and concise to me. Certainly it is simpler than printing tables of cut-off points for each possible study sample size.

    Beyond the choice of the type of test, the hinkyness comes in with the post-hoc picking of the test which minimizes the p-value. This type of decision rule can lead to wildly inflated type I error even when the tests themselves all maintain their level.

    In the case of diagnostics, 92.64% can be a far cry from 95%. With 92.64% you get nearly 1.5 times more missed melanoma diagnoses compared to 95%.

    I definitely agree with you that the sensitivity should in no way be the only, or even the primary metric in the decision to continue study. I suspect that it wasn't the only thing the panel considered when making their judgment.

  13. Ian: Right – arbitrary but not post-hoc is optimal – and given you are going to break the likelihood principle all speed ahead and dam the torpedoes!

    Phil: While still remaining somewhat anonymous – I learned early in my clinical research career that when the senior people wade in – the statistical significance thingy is just a starting point that sets the tone for the consideration of all else. Without becoming distracted by this particular case my guess is it would not be that much of a burden for the sponsor to do another study – that’s what study analyses or meta-analysis are really about – do we know enough to make a decision now or should other studies be done and if so how should they be done. For treatments for weaponized pathogens – animal studies are more than adequate!

    K?

  14. The gold standard for diagnosing malignant melanoma is the biopsy followed up by a pathology report. The typical dermatologist tends to have a high false positive rate. They will biopsy anything that has the remotest chance of being positive. On the other hand, if I'm remembering correctly, most GPs have a high false negative rate. The idea of the MELA machine is to give the patient a diagnosis in advance of the pathology report. The inventor of the MELA machine (who I know personally) once asked me what I thought it was worth to have an early diagnosis. I said "no more than fifty dollars." Another and perhaps more important use of the machine is for mass screening. Thus the false negative rate is very important. Phil is wrong in thinking there's not much difference between 95% and 92.64% in this context. The former means 500 cases go undetected in 10,000 screenings, while the latter means 740. An extra 240 missed detections is a lot. Those people will go away thinking they're ok when they are not. At this point, the FDA should license the machine only for advanced diagnosis. Then the false negatives will get picked up by the biopsy. For mass screening, I should think the machine should detect a melanoma with better than a .99 probability with an associated false positive rate of no more than 66%.

  15. @Ian, you say applying the exact mid-P method "is simpler than printing tables of cut-off points for each possible study sample size." I don't get it. According to my understanding, you would just need a single table, and indeed it could be very short since you really just need to list the maximum number of failures:
    Up to N = 88,

  16. Phil,

    No, my claim is that the statement:

    'You must demonstrate that the device has better than 95% sensitivity using a principled statistical test at alpha=.05.'

    is simpler than printing out a bunch of tables.

  17. @Ian, I still don't see why there would be "a bunch of tables." Why isn't it just one table?

    I think it's just one table. I think it is easier to look up a single number in a table than it is to perform any kind of statistical test based on the number. Perhaps we simply disagree on that. In this case, though, note that MELA and BER disagree on the outcome of the test; surely (?) they wouldn't disagree on what number is written next to an entry in a table! So even if you won't agree with "simpler", perhaps you will grant me "less fallible"?

    This is, however, not my main point. My main point is that this seems to me to be a poor rule for making a decision.

  18. CH – happy to see the plots on page 21

    Any comments might be rash without carefully reading the other 50 something pages.

    But those plots remind me that we usually lie to people about there being a p_value – its really a p_value function!

    Here a function of the unknown parameter and more generally also a function of unknown nuisance parameters. Part of the problem might be that there may be some open questions about how to define or calculate these functions – i.e. not fully worked out yet.

    Someone could work out such a plot for Phil's suggested method – would be nice to see it.

    K?

  19. The old (Peircian) 1-2-3

    Hypothesis: “The BER report, however, relies on its own analysis to suggest that MELA struck out with FDA because the agency’s medical device reviewers discovered the MELAFind pivotal study failed to reach statistical significance despite the company’s claims to the contrary.”

    Brute Force Reality: “Rather than focus on the binary decision of whether primary aim A1 was met or not, all of the analysis methods can be said to show borderline significant results for sensitivity being greater than 95% using a one sided test.” (from CH’s FDA report link)

    Lesson of 1&2: All hypotheses/models are false and maybe in this case a tendency to blame failure on technical issues not widely understood? (But then this seemed a productive mistake for this blog or at least for me)

    K?

Comments are closed.