Question

News:NCBI wants to remove quality scores from SRA

2

Entering edit mode

4.1 years ago

zhousun21 ▴ 40

NCBI has recently put out an RFI to get public feedback on removing quality scores from its SRA database. They propose this as a cost-saving measure due to the large amount of data that must be stored.

This will obviously render SRA data useless for genome assembly because the low quality regions will not be able to be trimmed. If this will impact your research please comment here: https://datascience.nih.gov/sra-rfi-submission

The more comments they get the less likely they will do this, so please pass this information to other colleagues.

quality-scores trimming Assembly genome SRA • 2.4k views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 4.1 years ago by zhousun21 ▴ 40

0

Entering edit mode

There is a good post about quality scores as it applies to variant calling: https://lh3.github.io/2020/05/27/base-quality-scores-are-essential-to-short-read-variant-calling

ADD REPLY • link 4.1 years ago by igor 13k

1

Entering edit mode

From blog link:

Using 2 quality bins (i.e. good/bad) gives a dramatic improvement over no-quality, though the result is not as good as 8-binning.

2 quality bins is closest to the proposal of NCBI that is keeping only 1 bin.

ADD REPLY • link 4.1 years ago by GenoMax 144k

score 3 · Answer 1 · 2020-07-03

I am going to take the liberty of posting a section from NCBI proposal below that describes the actual process of quality conversion so people can read it right here.

While this may appear like a drastic change, Illumina has been doing something similar for a while by producing binned Q-scores for larger datasets for a while. In a way NCBI is going to do part of the work for you. Instead of having to check each base now you use the read that passes the filter as defined below.

While cloud storage may be cheaper it is still not free. Making users pay for the downloads while keeping original Q scores intact would lock out a large population of researchers across the world who simply won't be able to pay. So a solution that can still work reasonably well for NCBI is needed.

EBI/ENA and DDBJ may choose to go a different route and keep the original data available. Users can just go there in that case, like they can do now to get fastq files directly.

The BQS removal process removes quality scores from an SRA file. The process assesses overall read quality and sets a per-read quality flag. In the resulting files, all reads have a Read_Filter flag with value reject or pass.

In the resulting files, all reads have a Read_Filter flag with value reject or pass. Illumina fastq and Sam/Bam specifications support a quality bit that is set by the sequencing instrument. SRA format stores this as a pass/reject Read_Filter value. If this bit is set in the submitted fastq or bam file, the value will be retained. If it is not set, SRA will set a pass/fail value based on the quality score distribution.

Reads that have more than half of quality score values <20 will be flagged reject. Reads that begin or end with a run of more than 10 quality scores <20 are also flagged reject. When accessing or dumping data from SRA format using fastq-dump or fasterq-dump utilities in the SRA Toolkit, rejected reads are not used by default. There are options for including them:

fasterq-dump --read-filter <[pass|reject]>

It is still possible to produce FASTQ from ETL-BQS files using the SRA Toolkit. In this case, the FASTQ will have a constant quality score set to 30 for reads with Read_ Filter value pass and 3 for reject reads.