prefetch and fast-dump problems?
1
0
Entering edit mode
14 months ago
debitboro ▴ 220

Dear all,

I want to download some SRR files, and then convert them to fastq files. For that, I've used the following SRA-toolkit commands:

prefetch SRR3159525
fastq-dump SRR3159525.sra


The download was done successfully, and the size of the resulted fastq file seems correct (~8G).

But when I've checked the content of the fastq file, I found the file was strangely formatted as follows:

 @SRR3159522.1 2_33_78 length=50
T..................................................G
+SRR3159522.1 2_33_78 length=50
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@SRR3159522.2 2_36_51 length=50
T..................................................G
+SRR3159522.2 2_36_51 length=50
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@SRR3159522.3 2_39_77 length=50
T30.0..2.0.....2.2..2.0..0......0....1...220.2.3322G
+SRR3159522.3 2_39_77 length=50
!(*!%!!(!%!!!!!%!%!!%!&!!%!!!!!!*!!!!&!!!%%*!%!&%'%!
@SRR3159522.4 2_39_134 length=50
T01.0..0.1.....2.0..2.2..2......1....1...231.0.3312G
+SRR3159522.4 2_39_134 length=50
!1&!(!!&!.!!!!!%!(!!%!%!!)!!!!!!)!!!!%!!!%%(!%!/)%%!
...
...


As you can see, the sequence of the reads contains integers delimited by T and G?

0
Entering edit mode

There is something weird with this submission. Initial reads are odd looking as you posted while some of the later ones look like

>gnl|SRA|SRR3159525.999995.1 60_1827_793 F3 (Biological)
ACGCATGCCTGCTGTAGTCAATTAAGTACACAAACTGACATCCANNNNNN
>gnl|SRA|SRR3159525.999995.2 60_1827_793 (Biological)


Looks like Read1 = 50 is somewhat OK, Read 2 = 35 bp is empty :-(

Contact SRA support to see if they have anything to say.

0
Entering edit mode
14 months ago
debitboro ▴ 220

After googling, some biostars posts like (Transforming And Manipulating Color Space Reads) talk about the Color Space representation of the reads generated by some sequencing instruments which operate with color space formats like ABI-SOLID. For such a system the content of the reads are integers representing the colors, then an encoding table can be used to convert the integers to DNA bases.

Please refer to this excellent post which explains the system in more details: Transforming And Manipulating Color Space Reads

0
Entering edit mode

Indeed. SOLiD datasets are rare that it was an easy miss. SOLiD data is likely not worth the hassle since only one or two aligners (older versions) likely support it.