Question

How Does Bowtie Handle Colorspace Data?

2

Entering edit mode

12.8 years ago

Bioinfo ▴ 330

It is not entirely entuative (at least to me) on how Bowtie handles colorspace data. Here is a refernce in the manual: http://goo.gl/O53Bk

In short, it trims the T primer base and the adjacent color. One would assume that on a 35 bp read this wouldn allow for only a maximum alignmnet of 33bp in NT space and for a 50bp read, 48bp.

I have seen other explanations, such as "The ones that align are 48 nucleotides, 50 colors - the first is chopped off -> makes 49 colors what represents four nucleotide strings with length of 50. Then you leave out the two nucleotides that are only covered by one color (first and last), ending up with 48 nucleotides. (Exerpt from SeqAnswers forum: http://goo.gl/mGtaf)

So, my question is, how many bases (both NT and CS) are used in a colorspace alignment for N read length in CS.

solid bowtie alignment short aligner • 4.6k views

ADD COMMENT • link updated 12.5 years ago by Istvan Albert 100k • written 12.8 years ago by Bioinfo ▴ 330

score 1 · Answer 1 · 2011-10-31

BioScope is better for this kind of thing as they include colour-space tags and qualities in their BAM files - and so you see this sort of thing in their SAM lines:

650_283_1979 pP1 chr20 10000065 150 50M = 9998966 -1149 AACATTCTAAAATAATATCAATTTCTTTCTCTCCTTGCCATTTTTACAAA IIIIIIIIII=IIIIIIIICBBBIIIIIIIIIIIIGIIIIFFIIIIIII; RG:Z:sys.S1 CS:Z:T30113022300033033321030022002222202013013000031100 CQ:Z:9;A7<<9:=;1->=;78;95/4/4:957<<;77:;53<:8616;8<:0<; MD:Z:50M NM:i:0

Or as a clustal-like alignment:

               1         2         3         4         5
      123456789012345678901234567890123456789012345678901
      T30113022300033033321030022002222202013013000031100
      -AACATTCTAAAATAATATCAATTTCTTTCTCTCCTTGCCATTTTTACAAA
REF:  CAACATTCTAAAATAATATCAATTTCTTTCTCTCCTTGCCATTTTTACAAA
R&C:  C10113022300033033321030022002222202013013000031100

REF = sequence in the genome R&C = colour-space based on base in position 1 being that from the genome

So there are 50 colour-space values after the initial primer (line length = 51 with the T). And although they say ignore the first colour after the initial primer - what they mean is use it to say T3 => A; but then to also say: T does not align with the reference in this case. It is C and to go from C to A is this: C1 => A. Hence 3 is wrong in the reference case but not for the sequence insert case. So, chuck the T and the artificial colour-space transition (T->A) associated with the artificial T in the sequence insert as it will break when used with the real reference base of C.

So the original colour-space in the clustal alignment starts: T30... but in the genome-corrected R&C line, it starts: C10...

Chris

score 1 · Answer 2 · 2011-11-01

Here is how I understand this matter:

When performing a color space alignment there are different approaches that one may undertake. The approach that bowtie is using is to encode the reference genome as color space transitions. For example an A followed by T will be red etc While there is a single choice of color when going from letter space to color space the reverse process would allow for four different encodings.

Reads will match in colorspace but when it comes to aligning the read they need to trim off the parts that cannot possibly match to the transitions in the genome. The first base is template therefore it needs to be discarded, the second base is a color but it encodes a transition from the fixed template base therefore it also needs to be trimmed away. The remaining colors represent transitions that may be found in the genome and are used in the alignment.

Finally I think (and this would need to be verified separately) this does not mean that the final alignment will be one base shorter. Once the alignment is found the software should be able match the one base that was trimmed away, it just won't use that while locating the read.