Question

SRA file looks wierd after conversion to fastq

0

Entering edit mode

9.2 years ago

Saad Khan ▴ 440

Hi I am trying to use a small RNA data from CD34 bone marrow cells to compare with another private data that I have.

I just downloaded it from SRA (sra id: SRR772115) and converted it using fastq-dump. But the results don't look typical of a fastq file. Since each read in fastq file is represented in 4 lines while here its not the case. Here is how the converted fastq file looks like.

@SRR772115.1 FCC0B8BACXX:8:1101:1416:2034 length=49
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGAGTGGATCTCGTATG
+SRR772115.1 FCC0B8BACXX:8:1101:1416:2034 length=49
bbbeeeeeggggghhiiiiihiiiiiiiiggfhicfghihhiiihhhii
@SRR772115.2 FCC0B8BACXX:8:1101:1317:2047 length=49
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGAGTGGATCTCGTATG
+SRR772115.2 FCC0B8BACXX:8:1101:1317:2047 length=49
bbbeeeeegggggghiiiiiiiiiiiiiiihiiicgghhghhiiiihih
@SRR772115.3 FCC0B8BACXX:8:1101:1437:2047 length=49
TATGGTCGCAAGGCTGAAACTTAAAGAAATTGATGGAATTCTCGGGTGC

Can anybody tell me if I am missing something here. And how to get the fastq in proper format.

regards

fastq SRA fastq-dump • 2.2k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Saad Khan ▴ 440

score 0 · Answer 1 · 2015-02-04

It does look like well formatted FASTQ, albeit with a bit of an odd encoding. You can use this to find the encoding: https://github.com/brentp/bio-playground/blob/master/reads-utils/guess-encoding.py

@SRR772115.1 FCC0B8BACXX:8:1101:1416:2034 length=49 #ID
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGAGTGGATCTCGTATG #Seq
+SRR772115.1 FCC0B8BACXX:8:1101:1416:2034 length=49 #ID
bbbeeeeeggggghhiiiiihiiiiiiiiggfhicfghihhiiihhhii #Qual

@SRR772115.2 FCC0B8BACXX:8:1101:1317:2047 length=49
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGAGTGGATCTCGTATG
+SRR772115.2 FCC0B8BACXX:8:1101:1317:2047 length=49
bbbeeeeegggggghiiiiiiiiiiiiiiihiiicgghhghhiiiihih

@SRR772115.3 FCC0B8BACXX:8:1101:1437:2047 length=49
TATGGTCGCAAGGCTGAAACTTAAAGAAATTGATGGAATTCTCGGGTGC
...
...

Ram · Answer 2 · 2015-02-04

What do you think is wrong with it? These reads look, and FastQC agrees:

##FastQC        0.10.1
>>Basic Statistics      pass
#Measure        Value
Filename        temp.fq
File type       Conventional base calls
Encoding        Illumina 1.5
Total Sequences 2
Filtered Sequences      0
Sequence length 49
%GC     53

It repeats the read name on the quality line (+) as well as the nucleotide line (@), but that's fine.