Question: Invariant base quality calls
1
gravatar for rbpdee
2.2 years ago by
rbpdee20
United States
rbpdee20 wrote:

Has anyone ever seen a FASTQC quality call plot like this one? -- https://figshare.com/s/ce7c531bcace09a6096a

This plot was generated from an SRA sequence file submitted with a published study. The SRA file was downloaded from the NCBI-SRA database and converted to fastq file using fastq-dump utility without any sequence processing. Is it possible that the authors submitted processed data to NCBI?

rna-seq • 687 views
ADD COMMENTlink modified 2.2 years ago by Petr Ponomarenko2.6k • written 2.2 years ago by rbpdee20

It looks incredibly likely that they preprocessed the data before upload.

ADD REPLYlink written 2.2 years ago by Devon Ryan89k

Thanks for your reply! As far as I know, one should submit raw, but not processed sequence data (directly coming from the sequencer) to the NCBI. Am I right? Do you have any information on this? Do I have to inform NCBI?

ADD REPLYlink written 2.2 years ago by rbpdee20

Yes, one should provide raw data to SRA, but this is far from the first case where someone didn't do that. I would suggest contacting whoever is listed as the submitter for the dataset first. Hopefully they still have the raw data...

ADD REPLYlink written 2.2 years ago by Devon Ryan89k

Thanks! Please see my reply to Petr. Can this outcome be due to a difference in the sequencing platform? GEO accession number and SRA accession numbers are also mentioned in my response.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by rbpdee20
0
gravatar for Petr Ponomarenko
2.2 years ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.6k wrote:

I have never seen such a thing. Phred score of 40 for all calls in all reads (this implies 1 error per 10,000 calls or 99.99% correct call). This is way too good! As far as I remember Illumina promised 75% of correct calls (above phred=30) across their platforms. The best raw data I have ever seen was just above 90% (maybe 91 or 92%) of the calls above phred 30.

If it is was preprocessed or simulated could you please tell us the purpose of the experiment and study for what it was used? Maybe it makes sense to analyze these reads that way,

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Petr Ponomarenko2.6k

My first guess would be data was generated/processed in fasta, then given artificial quality scores. But my first guess is often just oversimplifying stuff.

ADD REPLYlink written 2.2 years ago by WouterDeCoster38k

Hmmm... that seems possible!

ADD REPLYlink written 2.2 years ago by rbpdee20

Nope! This data was neither preprocessed nor simulated. I downloaded this data from NCBI-GEO/SRA database. From the published paper, which reads "Data accession: all the raw and processed data can be accessed under GSE86214 (https://www.ncbi.nlm.nih.gov/geo/).", the data should be raw, and it should not resemble some simulated data.

Here are a few SRA accession number yielding such plots: SRR5099289 - RNA Immunoprecipitation followed by RNAseq (sequenced on HiSeq 4000), SRR5099278 - regular RNAseq (sequenced on HiSeq 4000), and SRR5099284 - regular RNAseq (sequenced on HiSeq 4000).

SRR5099272 (sequenced on HiSeq 2000) belongs to the same Bioproject, which the authors submitted, and it does not produce such a plot.

I am not sure if this has to do with the Illumina sequencing platform.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by rbpdee20
1

Look at the read names, at least SRR5099289 was preprocessed.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Devon Ryan89k

Yes, you are right! I guess no other proof is needed. I will ask the sequence contributor for raw data.

ADD REPLYlink written 2.2 years ago by rbpdee20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 795 users visited in the last hour