Question

Invariant base quality calls

1

Entering edit mode

7.2 years ago

rbpdee ▴ 50

Has anyone ever seen a FASTQC quality call plot like this one? -- https://figshare.com/s/ce7c531bcace09a6096a

This plot was generated from an SRA sequence file submitted with a published study. The SRA file was downloaded from the NCBI-SRA database and converted to fastq file using fastq-dump utility without any sequence processing. Is it possible that the authors submitted processed data to NCBI?

RNA-Seq • 1.7k views

ADD COMMENT • link updated 7.2 years ago by Petr Ponomarenko ★ 2.8k • written 7.2 years ago by rbpdee ▴ 50

0

Entering edit mode

It looks incredibly likely that they preprocessed the data before upload.

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for your reply! As far as I know, one should submit raw, but not processed sequence data (directly coming from the sequencer) to the NCBI. Am I right? Do you have any information on this? Do I have to inform NCBI?

ADD REPLY • link 7.2 years ago by rbpdee ▴ 50

0

Entering edit mode

Yes, one should provide raw data to SRA, but this is far from the first case where someone didn't do that. I would suggest contacting whoever is listed as the submitter for the dataset first. Hopefully they still have the raw data...

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks! Please see my reply to Petr. Can this outcome be due to a difference in the sequencing platform? GEO accession number and SRA accession numbers are also mentioned in my response.

ADD REPLY • link 7.2 years ago by rbpdee ▴ 50

score 0 · Answer 1 · 2017-02-16

0

Entering edit mode

7.2 years ago

Petr Ponomarenko ★ 2.8k

I have never seen such a thing. Phred score of 40 for all calls in all reads (this implies 1 error per 10,000 calls or 99.99% correct call). This is way too good! As far as I remember Illumina promised 75% of correct calls (above phred=30) across their platforms. The best raw data I have ever seen was just above 90% (maybe 91 or 92%) of the calls above phred 30.

If it is was preprocessed or simulated could you please tell us the purpose of the experiment and study for what it was used? Maybe it makes sense to analyze these reads that way,

ADD COMMENT • link 7.2 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

My first guess would be data was generated/processed in fasta, then given artificial quality scores. But my first guess is often just oversimplifying stuff.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Hmmm... that seems possible!

ADD REPLY • link 7.2 years ago by rbpdee ▴ 50

0

Entering edit mode

Nope! This data was neither preprocessed nor simulated. I downloaded this data from NCBI-GEO/SRA database. From the published paper, which reads "Data accession: all the raw and processed data can be accessed under GSE86214 (https://www.ncbi.nlm.nih.gov/geo/).", the data should be raw, and it should not resemble some simulated data.

Here are a few SRA accession number yielding such plots: SRR5099289 - RNA Immunoprecipitation followed by RNAseq (sequenced on HiSeq 4000), SRR5099278 - regular RNAseq (sequenced on HiSeq 4000), and SRR5099284 - regular RNAseq (sequenced on HiSeq 4000).

SRR5099272 (sequenced on HiSeq 2000) belongs to the same Bioproject, which the authors submitted, and it does not produce such a plot.

I am not sure if this has to do with the Illumina sequencing platform.

ADD REPLY • link 7.2 years ago by rbpdee ▴ 50

1

Entering edit mode

Look at the read names, at least SRR5099289 was preprocessed.

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, you are right! I guess no other proof is needed. I will ask the sequence contributor for raw data.

ADD REPLY • link 7.2 years ago by rbpdee ▴ 50