Hi, this is reposted from Biology StackExchange, where I was advised to ask this question here.
I am interpreting CAGE data. I have noticed that in the SAM files I am analysing, there are many pairs of lines of the form (e.g.)
HWUSI-EAS1720_0023_FC63B9YAAXX:6:120:19103:21529#0/1 0 chrU 9824432 0 27M * 0 0 GTCTTGTACCGACGACAGATCTTTCAA KWXXWWWWWWWTWTWWYYYYY______ PQ:i:0XQ:f:0.097178 XP:f:0.190338
HWUSI-EAS1720_0023_FC63B9YAAXX:6:120:19103:21529#0/1 0 chrU 9824433 0 1S26M * 0 0 GTCTTGTACCGACGACAGATCTTTCAA KWXXWWWWWWWTWTWWYYYYY______ PQ:i:0XQ:f:0.00486 XP:f:0.00952
Note that the CIGAR string (6th column) is 27M—indicating 27 aligned bases—in the first line, and 1S26M—indicating one soft-clipped base and 26 aligned bases—in the second line, while the mapping position (4th column) is one base further along in the second line than in the first line. The actual reference DNA at position 9824432 of 'chrU' is GTCTTGTACCGACGACAGATCTTTCAA, i.e. an exact match to the read sequence (10th column) in both lines.
My question is: how should I interpret pairs of lines such as these? Note that XP, the conditional probability of the read coming from the given position (last column) differs between the two possible alignments, with XP for the first line almost exactly 20 times the XP for the second line.
I notice that these pairs seem to only appear when the first base of the read sequence is a guanine (or the last base is a cytosine, if the read is marked as reverse-complemented). So, are these pairs of lines perhaps because of CAGE's bias for nonspecific guanines at the 5' end of tags, and this information has been given to the alignment algorithm, indicating (for this example) that more likely this given read corresponds to transcription start at position 9824432, but maybe it is an erroneous read due to the guanine bias, off by one base?