It is not entirely entuative (at least to me) on how Bowtie handles colorspace data. Here is a refernce in the manual: http://goo.gl/O53Bk
In short, it trims the T primer base and the adjacent color. One would assume that on a 35 bp read this wouldn allow for only a maximum alignmnet of 33bp in NT space and for a 50bp read, 48bp.
I have seen other explanations, such as "The ones that align are 48 nucleotides, 50 colors - the first is chopped off -> makes 49 colors what represents four nucleotide strings with length of 50. Then you leave out the two nucleotides that are only covered by one color (first and last), ending up with 48 nucleotides. (Exerpt from SeqAnswers forum: http://goo.gl/mGtaf)
So, my question is, how many bases (both NT and CS) are used in a colorspace alignment for N read length in CS.
REF = sequence in the genome
R&C = colour-space based on base in position 1 being that from the genome
So there are 50 colour-space values after the initial primer (line length = 51 with the T). And although they say ignore the first colour after the initial primer - what they mean is use it to say T3 => A; but then to also say: T does not align with the reference in this case. It is C and to go from C to A is this: C1 => A. Hence 3 is wrong in the reference case but not for the sequence insert case. So, chuck the T and the artificial colour-space transition (T->A) associated with the artificial T in the sequence insert as it will break when used with the real reference base of C.
So the original colour-space in the clustal alignment starts: T30...
but in the genome-corrected R&C line, it starts: C10...
When performing a color space alignment there are different approaches that one may undertake. The approach that bowtie is using is to encode the reference genome as color space transitions. For example an A followed by T will be red etc While there is a single choice of color when going from letter space to color space the reverse process would allow for four different encodings.
Reads will match in colorspace but when it comes to aligning the read they need to trim off the parts that cannot possibly match to the transitions in the genome. The first base is template therefore it needs to be discarded, the second base is a color but it encodes a transition from the fixed template base therefore it also needs to be trimmed away. The remaining colors represent transitions that may be found in the genome and are used in the alignment.
Finally I think (and this would need to be verified separately) this does not mean that the final alignment will be one base shorter. Once the alignment is found the software should be able match the one base that was trimmed away, it just won't use that while locating the read.