News: NCBI wants to remove quality scores from SRA
2
gravatar for zhousun21
6 weeks ago by
zhousun2120
zhousun2120 wrote:

NCBI has recently put out an RFI to get public feedback on removing quality scores from its SRA database. They propose this as a cost-saving measure due to the large amount of data that must be stored.

This will obviously render SRA data useless for genome assembly because the low quality regions will not be able to be trimmed. If this will impact your research please comment here: https://datascience.nih.gov/sra-rfi-submission

The more comments they get the less likely they will do this, so please pass this information to other colleagues.

ADD COMMENTlink modified 6 weeks ago by genomax87k • written 6 weeks ago by zhousun2120

There is a good post about quality scores as it applies to variant calling: https://lh3.github.io/2020/05/27/base-quality-scores-are-essential-to-short-read-variant-calling

ADD REPLYlink written 6 weeks ago by igor11k
1

From blog link:

Using 2 quality bins (i.e. good/bad) gives a dramatic improvement over no-quality, though the result is not as good as 8-binning.

2 quality bins is closest to the proposal of NCBI that is keeping only 1 bin.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax87k
3
gravatar for genomax
6 weeks ago by
genomax87k
United States
genomax87k wrote:

I am going to take the liberty of posting a section from NCBI proposal below that describes the actual process of quality conversion so people can read it right here.

While this may appear like a drastic change, Illumina has been doing something similar for a while by producing binned Q-scores for larger datasets for a while. In a way NCBI is going to do part of the work for you. Instead of having to check each base now you use the read that passes the filter as defined below.

While cloud storage may be cheaper it is still not free. Making users pay for the downloads while keeping original Q scores intact would lock out a large population of researchers across the world who simply won't be able to pay. So a solution that can still work reasonably well for NCBI is needed.

EBI/ENA and DDBJ may choose to go a different route and keep the original data available. Users can just go there in that case, like they can do now to get fastq files directly.


The BQS removal process removes quality scores from an SRA file. The process assesses overall read quality and sets a per-read quality flag. In the resulting files, all reads have a Read_Filter flag with value reject or pass.

In the resulting files, all reads have a Read_Filter flag with value reject or pass. Illumina fastq and Sam/Bam specifications support a quality bit that is set by the sequencing instrument. SRA format stores this as a pass/reject Read_Filter value. If this bit is set in the submitted fastq or bam file, the value will be retained. If it is not set, SRA will set a pass/fail value based on the quality score distribution.

Reads that have more than half of quality score values <20 will be flagged reject. Reads that begin or end with a run of more than 10 quality scores <20 are also flagged reject. When accessing or dumping data from SRA format using fastq-dump or fasterq-dump utilities in the SRA Toolkit, rejected reads are not used by default. There are options for including them:

fasterq-dump --read-filter <[pass|reject]>

It is still possible to produce FASTQ from ETL-BQS files using the SRA Toolkit. In this case, the FASTQ will have a constant quality score set to 30 for reads with Read_ Filter value pass and 3 for reject reads.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by genomax87k

Illumina has been doing something similar for a while by producing binned Q-scores for larger datasets

Why not use binned values then? It seems like a decent compromise. 4 or 8 bins is big improvement over none, but still compresses well.

ADD REPLYlink written 6 weeks ago by igor11k

And that is what may happen. I am sure NCBI does not want to spend umpteenth compute cycles reprocessing data.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax87k

Wow, the FASTQ from this with all set to 30 or 3 sounds ridiculously bad. It also would mark ALL long read data, both from Nanopore and Pacbio, as "reject" and give FASTQ out with quality values of 3. Why ? Because long read data is generally quality 8-15 at present. Great.

I didn't know NCBI was so biased towards Illumina.

Yet even with Illumina data, this step essentially makes all quality trimmers obsolete. Long live the EBI!

ADD REPLYlink written 5 weeks ago by colindaven2.3k

We don't know what EBI is planning to do. We are hoping that they will not follow suite. This is currently only a proposal so they are looking for comments. Be sure to add yours at NCBI's site.

Alternatively, we may need to go to a pay-for-use (submitter or downloaders) model. It would be worse choice for much of the world, compared to this proposal.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax87k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 674 users visited in the last hour