Forum:“Systematic pattern of excess success"
2
3
Entering edit mode
8.5 years ago

Hi all!

Just came across this paper (Francis et al), one from the new "too good to be true" trend, and now I'm totally confused.

The paper provides an estimate of excess success in a set of studies published in Science on a statistical basis, by utilizing reported effect and sample sizes as well as null hypothesis acceptance/rejection ration. Its quite intuitive that when authors support their finding by 20 t-tests with extremely low effect size and P-values very close to 0.05 the finding seems very questionable.

To investigate this problem, the authors have chosen a P-TES (Test for Excess Significance) metric, calculated as the product of statistical test success given its effect size, e.g.

"The estimated probability that five experiments like these would all produce successful outcomes is the product of the five joint probabilities, P-TES = 0.018."

As the probability of success is <= 1, given a paper with a long list of experiments, it is highly likely that we've end up with P-TES < 0.05. In other words, the P-TES score is heavily dependent on the complexity of phenomenon under study.

The authors suggest extending their methodology to check papers in the field of biology. As for bioinformatics, we usually provide lots of complementary analysis for a phenomenon under study. E.g. performing RNA-Seq, Methyl-Seq and ChIP-Seq under multiple conditions for a given transcription factor, checking for its motif over-representation, etc. Would this automatically render a thorough bioinformatics analysis having "excessive probability of success".

Am I missing something critical here??

paper statistics • 2.4k views
3
Entering edit mode
8.5 years ago

I don't get this article and these types of articles in general ( "why most research findings are false" etc).

The original authors of the paper that started this trend (Ioannidis et al) found a paradoxical niche where they make use of p-values and statistics to prove that other p-values and statistics must be wrong - a claim revisited by the paper above. It sort of goes with the group think and envy that is associated with high profile papers.: "other people's Nature papers suck, mine would never"

I think p-values have many problems - especially in life sciences (and social sciences) where the dimensionality of the problem and that of unknown factors is substantially larger. But the solution is not another p-value based test.

0
Entering edit mode

I think that Francis et al make some good points, he also published a paper - Francis, G. Too much success for recent groundbreaking epigenetic experiments. Genetics 198.2 (2014): 449-451. Any scientific domain that relies heavily on statistical analyses is open to such criticism...

0
Entering edit mode

I followed Dias and Ressler (2014)'s treatment of the experiments as being statistically independent, so the probability of a set of 10 behavioral experiments like these all succeeding is the product of the probabilities: 0.023.

An assumption that experiments are statistically independent is a rather strong one. I don't think a finding with P=0.04 has that much impact, yet considering a scientific paper as a set of unconnected experiments seems rather strange. I believe a more thorough approach should be used instead, e.g. considering a tree-like structure of decisions to perform a certain experiment and using Bayesian framework to calculate the joint probability of success.

0
Entering edit mode

Experiments do not have to be unconnected to be statistically independent. Statistical independence is what makes multiplying the probabilities appropriate. The experiments themselves are connected by the authors' proposed relation to their theory.

0
Entering edit mode

I agree that statistical independence is quite different term than logical connectivity of a paper. Still the very common thing in research is the following:

One performs a pilot experiment and observe high value of variable X in the group of interest, then given it is known from literature that X and another variable Y have high positive correlation, one would be also interested at checking what happens with Y given it characterizes an important factor. Would it be right to just multiply the probabilities of success for P(x>x0)<0.05 and P(y>y0)<0.05 in this case?

0
Entering edit mode

If you are measuring X and Y in a common sample, then you have to take the correlation between them into account, which will always give you a smaller value than just multiplying the individual probabilities of X and Y. You can see examples of this in the PLOS One paper, where sometimes the original paper reported multiple measures from a single sample and reported the correlation between them (or the sample correlation can be computed from other statistics). We then used Monte Carlo simulations to estimate the probability of success for both measures.

0
Entering edit mode

Indeed, in this case P(X,Y) != P(X) * P(Y) for observables X and Y, but you are multiplying the probabilities which are more complex ones, i.e. P(p-value < 0.05).

And given that P-value reflects the probability of rejecting the null hypothesis when it is false, the probability in question also depends on a binary random variable, the null hypothesis state in a given experiment. So to multiply those probabilities you must ensure that the null hypotheses are also independent, while those could actually be dependent for linked set experiments. Please correct me if I'm wrong

1
Entering edit mode

I'm not sure I follow your comment, but I will try to address it. When a successful outcome is to reject the null, we estimate the probability of success by taking the observed effect size and use it to estimate experimental power. It's not a binary random variable because the effect size estimates the magnitude of the effect. Thus, p=0.03 would give an estimated power of 0.58, while p=0.01 would give an estimated power of 0.73. (The full calculation involves computing an effect size, which requires knowing the sample sizes, and then estimating power from the effect size and sample sizes. To a first approximation, you can go straight from the p-value to power.)

You are correct about the dependence of the hypotheses. For example, in psychology it is common to look for a significant interaction and then look at contrasts to help understand the interaction. Often a successful experiment requires a significant interaction and a particular pattern of significant and non-significant outcomes for the contrasts. We took all that into account with our Monte Carlo simulations.

There were only a few cases in the PLOS One paper where tests were performed between experiments. Sometimes that prevented us from analyzing a paper (because we could not estimate success for four or more experiments).

In short, the TES is kind of like a model checking procedure. We suppose that the theory is correct and that the effects are as identified by the reported experiments. With that as a starting point, we estimate the probability of the reported degree of success, as defined by the hypothesis tests, using the same analysis as was used by the original authors.

0
Entering edit mode

I agree that it appears like a vicious circle of p-values. Anyways, what seems strange to me is that even if the same result would be reproduced with P = 0.03 by three independent groups, it would become even more suspicious according to proposed framework. As for me, just showing a boxplot with outliers and reporting the effect size is far more informative for distinguishing important findings from p-hacked ones.

1
Entering edit mode

Even for the same effect and the same sample sizes, the p value should vary (a lot) from study to study, just due to random sampling. If four out of four experiments produce a p-value less than the .05 criterion, then that suggests that the experimental design (taking into account the sample sizes and the effect size) should usually produce a p-value much smaller than .05. If experiments often produce p-values around .03, then they should sometimes produce values larger than .05. The absence of the non-significant findings suggests that something is wrong in reporting, sampling, analyzing, or theorizing.

1
Entering edit mode

I was simply saying that if one takes all papers for a given phenomena, say 50 papers each containing 10 experiments with probability of success of 0.99, he automatically gets Ptes < 0.01. Is there any way to correct for this? On the other hand, if one measures enrichment for 10000 Chip seq peaks, he could get P=10^-20 simply due to some bias, which won't be reproduced at all in subsequent studies with careful controls.

0
Entering edit mode

There is nothing to correct. For 500 experiments that each have a success probability (power) of 0.99, the expected number of successful (significant) outcomes is 500 * 0.99 = 495. Now, in this case we can see that we are only overly successful by 5 experiments, but in any real world situation we do not know that the true power is 0.99. So, when we see 500 successful experiments with an estimated power of 0.99 all we know is that something is odd. We do not know how odd things are, so an experiment set like that should be carefully scrutinized for sources of bias. Of course, the whole analysis is based on the assumption that the 500 studies are related to a single theory. If they are just 50 papers studying different things, then we need not be concerned. That is, we would be introducing the bias ourselves by grouping these studies together and leaving out other (maybe non-significant) studies.

0
Entering edit mode
8.5 years ago

Excess success depends on the details of the findings and on how they are interpreted. If your bioinformatics investigation requires many statistical outcomes to show success, then each outcome needs to have high signal relative to the noise (otherwise, there is a high probability of at least one of those outcomes not working).

If your investigation can allow for some discrepancies (and you report them as discrepancies), then excess success most likely will not be indicated. None of the Science papers had any outcomes reported as discrepancies.

0
Entering edit mode

My explanation for what is observed is that one can't actually publish a paper in Science unless all the experiments support it - regardless of what statistics holds. Having even one experiment contradict it would raise way too many issues with the reviewer that typically are not all that well trained in the art of statistical inference.

Hence what the TES measure quantifies is the problem of the selection process and not that of the quality of the research.

0
Entering edit mode

But the selection process and the quality of the research are related. If one null result invalidates the conclusions, then it might make sense for Science to not publish it (whether one null result really does invalidate the conclusions is a separate issue). Nevertheless, findings do not support their conclusions when it is apparent that there should be unsuccessful findings, but none are shown. That just indicates that something is wrong in the reporting. We cannot know what went wrong so, without further information, we have to be skeptical about the findings and the theoretical claims.

0
Entering edit mode

From your responses I assume that you are the first author of the paper in question. First and foremost we'd like to thank you for participating and discussing your paper.

What I am suggesting is that the reviewing methods of high profile journals may be such that they will more likely filter out works that are valid but have one or more contradictory evidences. We all know that reviewing can be very subjective. Hence the sample may already been biased. Whether or not that is true is debatable and perhaps is a study on its own.

0
Entering edit mode

You may be right about the journal filtering process leading to a high rate of excess success articles. What I mostly care about is the relation between the data and the theory. The reason those articles are biased and the reason they are published in Science is of less interest (to me).