explaination of NCBIs spot descriptor
0
0
Entering edit mode
7.9 years ago
lstbl ▴ 40

I'm very confused by NCBI's spot descriptor language, I'm hoping someone on here has some insight.

For example, if we look at this entry: http://www.ncbi.nlm.nih.gov/sra/SRX849019, the spot descriptor says forward 1 reverse 152. Does this mean that, when you dump the file (using fastq-dump without --split-reads), you get a fastq file where there are sequences that are 300 bp long, the forward sequence starts at position 1 and the reverse sequence starts at position 152? Does this sequence include adapters (most of my experience with the SRA database is that the adapters have been pre-trimmed and the fastqs are analysis ready). According to this post, adapters, etc should be indicated in the spot descriptor, but is that always the case? If not, what is your preferred method of figuring out if the reads are untrimmed?

However, I have noticed in these files there are situations where you have reads that are not paired (see ERR753090 for many examples, the file contains reads that are both 150 and 75 bp, indicating that some reverse reads have not been sequenced). I this situation, the --split-reads option dumps unequal length files, which will be incompatible with bwa mem, since this program assumes that the i'th read in file1 is paired with i'th read in file2. The --split-3 option seems to eliminate this problem by dumping the unpaired reads in a separate file, but it seems like ncbi is phasing this out (which seems like a terrible idea to me).

sra ncbi spot descriptor • 3.0k views
ADD COMMENT

Login before adding your answer.

Traffic: 1463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6