Question

explaination of NCBIs spot descriptor

0

Entering edit mode

9.1 years ago

lstbl ▴ 40

I'm very confused by NCBI's spot descriptor language, I'm hoping someone on here has some insight.

For example, if we look at this entry: http://www.ncbi.nlm.nih.gov/sra/SRX849019, the spot descriptor says forward 1 reverse 152. Does this mean that, when you dump the file (using fastq-dump without --split-reads), you get a fastq file where there are sequences that are 300 bp long, the forward sequence starts at position 1 and the reverse sequence starts at position 152? Does this sequence include adapters (most of my experience with the SRA database is that the adapters have been pre-trimmed and the fastqs are analysis ready). According to this post, adapters, etc should be indicated in the spot descriptor, but is that always the case? If not, what is your preferred method of figuring out if the reads are untrimmed?

However, I have noticed in these files there are situations where you have reads that are not paired (see ERR753090 for many examples, the file contains reads that are both 150 and 75 bp, indicating that some reverse reads have not been sequenced). I this situation, the --split-reads option dumps unequal length files, which will be incompatible with bwa mem, since this program assumes that the i'th read in file1 is paired with i'th read in file2. The --split-3 option seems to eliminate this problem by dumping the unpaired reads in a separate file, but it seems like ncbi is phasing this out (which seems like a terrible idea to me).

sra ncbi spot descriptor • 3.3k views

ADD COMMENT • link 9.1 years ago by lstbl ▴ 40