Invariant base quality calls
1
1
Entering edit mode
7.2 years ago
rbpdee ▴ 50

Has anyone ever seen a FASTQC quality call plot like this one? -- https://figshare.com/s/ce7c531bcace09a6096a

This plot was generated from an SRA sequence file submitted with a published study. The SRA file was downloaded from the NCBI-SRA database and converted to fastq file using fastq-dump utility without any sequence processing. Is it possible that the authors submitted processed data to NCBI?

RNA-Seq • 1.7k views
ADD COMMENT
0
Entering edit mode

It looks incredibly likely that they preprocessed the data before upload.

ADD REPLY
0
Entering edit mode

Thanks for your reply! As far as I know, one should submit raw, but not processed sequence data (directly coming from the sequencer) to the NCBI. Am I right? Do you have any information on this? Do I have to inform NCBI?

ADD REPLY
0
Entering edit mode

Yes, one should provide raw data to SRA, but this is far from the first case where someone didn't do that. I would suggest contacting whoever is listed as the submitter for the dataset first. Hopefully they still have the raw data...

ADD REPLY
0
Entering edit mode

Thanks! Please see my reply to Petr. Can this outcome be due to a difference in the sequencing platform? GEO accession number and SRA accession numbers are also mentioned in my response.

ADD REPLY
0
Entering edit mode
7.2 years ago

I have never seen such a thing. Phred score of 40 for all calls in all reads (this implies 1 error per 10,000 calls or 99.99% correct call). This is way too good! As far as I remember Illumina promised 75% of correct calls (above phred=30) across their platforms. The best raw data I have ever seen was just above 90% (maybe 91 or 92%) of the calls above phred 30.

If it is was preprocessed or simulated could you please tell us the purpose of the experiment and study for what it was used? Maybe it makes sense to analyze these reads that way,

ADD COMMENT
0
Entering edit mode

My first guess would be data was generated/processed in fasta, then given artificial quality scores. But my first guess is often just oversimplifying stuff.

ADD REPLY
0
Entering edit mode

Hmmm... that seems possible!

ADD REPLY
0
Entering edit mode

Nope! This data was neither preprocessed nor simulated. I downloaded this data from NCBI-GEO/SRA database. From the published paper, which reads "Data accession: all the raw and processed data can be accessed under GSE86214 (https://www.ncbi.nlm.nih.gov/geo/).", the data should be raw, and it should not resemble some simulated data.

Here are a few SRA accession number yielding such plots: SRR5099289 - RNA Immunoprecipitation followed by RNAseq (sequenced on HiSeq 4000), SRR5099278 - regular RNAseq (sequenced on HiSeq 4000), and SRR5099284 - regular RNAseq (sequenced on HiSeq 4000).

SRR5099272 (sequenced on HiSeq 2000) belongs to the same Bioproject, which the authors submitted, and it does not produce such a plot.

I am not sure if this has to do with the Illumina sequencing platform.

ADD REPLY
1
Entering edit mode

Look at the read names, at least SRR5099289 was preprocessed.

ADD REPLY
0
Entering edit mode

Yes, you are right! I guess no other proof is needed. I will ask the sequence contributor for raw data.

ADD REPLY

Login before adding your answer.

Traffic: 1492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6