NGS forensics: how to know if data is fabricated
3
4
Entering edit mode
12 days ago
noodle ▴ 580

Hi all,

I tried to repeat the results of a published paper and was unable. I have now dived deeper into their NGS data (Illumina HiSeq held on SRA) and see that the data was clearly 'cleaned' **. I am also now thinking the data might have been outright fabricated.

I'm wondering if others have encountered this and 1) how to verify from that the data is fake (from a technical, forensic standpoint), and 2) if the data is fake, how to handle the situation.

**I believe the data was cleaned because 100% of reads pass cutadapt, even though 70% of reads contain adapters and get trimmed. I find this situation to be impossible (but please correct me if I'm wrong!).

TIA!

fastq STAR NGS Illumina • 2.0k views
ADD COMMENT
3
Entering edit mode

While I think this is an interesting case, I've before found cleaned 'raw' data on SRA. It happens: bioinformaticians receive the raw data, they run a standard cleaning pipeline and pass it on to the lab-people, 6 months pass, time to upload to SRA, the raw data is long gone, only the first-pass cleaned data survived on someone's external hard-drive and goes to SRA. I've also received raw reads from sequencing providers with basic trimming applied, i.e., no adapters etc.

ADD REPLY
2
Entering edit mode

Yeah some of the Illumina machines pre-trim adapters (not sure if this is still the standard). I've also definitely also seen quality-trimmed data on the GEO. But I'm not sure 'cleaned' is equivalent to fabricated.

ADD REPLY
2
Entering edit mode

My first question would be how strong your background in such analysis is. Claim of fabrication is very serious, so be 100% sure to back it up and make sure it's not due to a potentially flawed analysis or interpretation. Woth the given information it is hard to comment in more detail.

Data being cleaned could simply be that accidental the raw fastq files were list and only trimmed ones were retained and uploaded. Not good obviously, but no fabrication.

ADD REPLY
0
Entering edit mode

My first question would be how strong your background in such analysis is.

Very strong. PhD+several years working in the field.

Woth the given information it is hard to comment in more detail.

To expand further, the data is RNA-seq. Some samples are antibody enriched and some 'control' ribodepleted RNA. Based on picard 'CollectRnaSeqMetrics', the data looks like someone did shotgun sequencing of a genome and then spiked in exactly the data they wanted to get the peaks of the antibody enrichment. The supposedly directional reads have 40% 'wrong orientation' and intergenic reads are #1 followed by intronic #2 and finally transcripts. rRNA content is unbelievably low (<0.01%). This alone would be hallmarks of DNA contamination, but because the RNA reads are exactly what the paper would want to see it's very unbelievable the data isn't faked.

I wrote here to ask the community if there is perhaps an algorithm developed to check the reads if a pattern emerges or something else that is non-random, that could be an absolute smoking gun for the data being fabricated.

ADD REPLY
2
Entering edit mode

I'd say post on pubpeer -- it's the best forum for this sort of discussion.

As for what additional analysis I recommend: I'd say look at splice junctions. All RNA-seq data should have a good amount of spliced transcripts, even in nucleus extracts. I'd say also look at gene body coverage. Usually, you'll see either something that's uniform/central, 3' biased, or 5' biased. Also, plot counts vs. transcript length.

You can compare these things to dozens other RNA-seq datasets from different assay types, different tissue/cell types, different experimental conditions, different library strategies, etc. If you look at 100 datasets (and your sampling is good), and that one dataset looks strikingly different than the remaining 99 datasets, that means there's an anomaly. You can then also try this for a few WGS datasets and see if that dataset looks similar to those.

The "smoking gun" would be to have other labs repeat the same experiment and realize they can't reproduce the study's findings.

ADD REPLY
0
Entering edit mode

I'm a bit against pubpeer - have you ever posted there?

My gripe with them is that the posts are heavily moderated and EDITED to the point that the original post many lose the intention. Moderation is fine, but the editing part is wrong. IMO the right approach to this is to bounce a post back and forth between the moderators and the author of a post until something is agreed on. But what happens in that a post is edited and posted without the original authors approval.

ADD REPLY
1
Entering edit mode

I don't follow your argument for distinguishing between genomic contamination and freud. A data set being bad in terms of genomic contamination does not mean its bad in other ways.

The other possibility is that the RNAseq library is very high quality, but there was a mix up with indexes at the sequencing facility.

In terms of a smoking gun, I think it would be almost impossible to distinguish between a well done fraud and a messed up experiment. You could look at the fragment size distribution, they is something someone might forget to vary if they were taking data, but I good freud could easily fake that if they remembered.

If it were standard RNAseq, you could check for the dispersion in read counts, but 1) this would, again, be a sign of sloppy faking, a good faker could easily fix that and 2) who knows what mixing pull down RNAseq with genomic contamination would do to the count distribution.

ADD REPLY
0
Entering edit mode

Can you post an IGV screenshot of what this looks like?

ADD REPLY
0
Entering edit mode

Working on it - it's actually a bit difficult. Low coverage of ribosomal proteins. Low coverage of housekeeping genes ...I'm trying to find an area that visually depicts what I'm describing but I'm actually having trouble because of the issues mentioned above - it's like someone combined 0.001x genome coverage with an RNA spike

ADD REPLY
1
Entering edit mode

Can you clarify what you mean by "100% of reads pass cutadapt, even though 70% of reads contain adapters and get trimmed. " did you set a minimum post trimming size threshold? Weekday was your cutafapt commandline?

ADD REPLY
0
Entering edit mode

Can you clarify what you mean by "100% of reads pass cutadapt, even though 70% of reads contain adapters and get trimmed.

The header of one output (granted only 57.6% here), despite passing -m 20

=== Summary ===

Total reads processed:              84,151,302
Reads with adapters:                48,541,812 (57.6%)
Reads written (passing filters):    84,151,302 (100.0%)

Relevant parts of cutadapt command;

cutadapt -a AGATCGGAAGAGCACACGTCTGAAC -A AGATCGGAAGAGCGTCGTGTAGGGA -m 20 --nextseq-trim=11
ADD REPLY
1
Entering edit mode

In which case, I'd definately look at the distribution of read lengths, post trimming, and see if there is a discontinuity in the distribution.

ADD REPLY
6
Entering edit mode
11 days ago
Mensur Dlakic ★ 27k

I share the opinion by Philipp Bayer I have seen enough questionable things in life and in science to keep me permanently jaded, but there are non-fraudulent explanations in this case.

Once we asked a lab to provide a yeast strain from a study that published results slightly contradicting our previous analyses. Then we asked again. Never heard back from them. When the reviewers in subsequent papers asked us to reconcile our data with theirs, our response was that they never shared their strains. The reviewers didn't press us further. They understood that because this particular result was never confirmed by any other lab, the burden of proof must be with the original authors even though they managed to publish their work. Still, because of cultural differences and some other things I can't explain in detail, we thought there was a good chance that their results were produced with best intentions. We never felt comfortable going after them. Questioning someone's reputation in today's world often leads to bias against the accused even before the case is properly adjudicated. I would never do that to someone before first contacting them directly, nor would I want anyone to do something like that to me without a chance to explain if there was any dispute.

My suggestion is to contact the authors directly and ask for raw data. How they respond will provide some information about the next steps.

ADD COMMENT
3
Entering edit mode

Thanks Mensur! I'd add that in 9 out of 10 cases, you won't receive a reply to a request for raw data. That's just how scientists are, doesn't mean the data was faked. You can escalate things to the journal, up to you. Looking at Elisabeth Bik's experience with faked images in published papers, the journal will likely drag its feet for a long time.

ADD REPLY
4
Entering edit mode
12 days ago

This is a super-interesting question from an algorithmic standpoint (devising a model that can distinguish real from synthetic reads) but I think the way to proceed is to:

  • contact the authors directly

if they do not respond, collect some stats from MultiQC (both for the fastq and bams - whatever is available) and post your concerns on pubpeer.

ADD COMMENT
0
Entering edit mode

This is a super-interesting question from an algorithmic standpoint

Ya, I was hoping to find some algorithm that would compare say a 'reference fasta + expression profile' to the fastq files to check if there is anything suspicious from the forensics standpoint - like how the data follows a distribution, etc.

ADD REPLY
1
Entering edit mode

I don't think people have undertaken the effort to create an anomaly detector for RNAseq -- people's efforts are dedicated towards developing algorithms for "non-anomalous" data ;)

ADD REPLY
1
Entering edit mode

IMO (and unfortunately) there needs to be an effort to develop these algorithms.

ADD REPLY
2
Entering edit mode

i have the name ready: outliar

ADD REPLY
1
Entering edit mode

i've been in our org trying to push such a detector, our use-case is much more limited: I want to make simulated eDNA libraries as real as possible. We have a Naive Bayes implementation that is (unfortunately!) very good at distinguishing simulated from real eDNA libraries, F1 > 0.95. We are working on tricking our own classifier, I guess our use-case is the opposite and far more harmless :)

ADD REPLY
0
Entering edit mode

Thats very intersting! What are the features in your classifier?

ADD REPLY
2
Entering edit mode

just k-mer counts (HashingVectorizer + Naive Bayes Classifier). I stayed away from looking at quality scores so far, simulators are usually awful at those and you can spot them yourself :)

ADD REPLY
1
Entering edit mode
10 days ago
Prash ▴ 280

Mensur, if I were you, I'd probably contact the CA directly and check with him whether they have reproduced the works recently. There could be "mistakes" that they could accept and return the manuscript as a corrigendum or a letter to the editor.

As dsull and others suggested, Pubpeer is another option where the authors could respond

ADD COMMENT
1
Entering edit mode

Appreciate the suggestion, but that ship has sailed and reached the other shore. This happened 10+ years ago. To the best of my knowledge neither the original authors nor any other lab in the world ever published a follow-up study that reached even remotely similar conclusions. I think there is little doubt that the original conclusion was flawed, so it becomes a matter of intent. I think there was no ill intent, but rather a poorly executed experiment where the results were not properly scrutinized. We are not always perfect as authors, and neither are the reviewers. As to amending/retracting the paper, the authors know best whether that is appropriate.

ADD REPLY

Login before adding your answer.

Traffic: 2300 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6