Question

Weird Fastq Sequences

1

Entering edit mode

12.3 years ago

Bioscientist ★ 1.7k

I download from 1000genome websites some fastq files shown as below:

@VAB_BARB_20080515_2_Broad_3b_150_2276_6_37_F3
T21123313121322132222331223311312223
+
!'$'&(,&#%4,('%$*$,##+0#-+($)#$%$$&)

What doesn't the second line show up ATGC? Or they use 123 to represent the letter?

Also, such data come from files named as XXXX.fastq.gz While those "normal" data come from files named as XXXX.recal.fastq.gz

So this inspires me to ask what does this "recal" mean?

thx

fastq genome • 2.0k views

ADD COMMENT • link updated 12.3 years ago by Gww ★ 2.7k • written 12.3 years ago by Bioscientist ★ 1.7k

score 8 · Answer 1 · 2012-01-06

Those read sequences are in colorspace rather than basespace, which means that the sequencing was performed using applied biosystems SOLiD sequencing technology. There are aligners that are capable of aligning reads in that format such as bioscope, SHRiMP and BWA. More information about the dibase encoding can be found here.