Hi, this is reposted from Biology StackExchange, where I was advised to ask this question here.
I am interpreting CAGE data. I have noticed that in the SAM files I am analysing, there are many pairs of lines of the form (e.g.)
HWUSI-EAS1720_0023_FC63B9YAAXX:6:120:19103:21529#0/1 0 chrU 9824432 0 27M * 0 0 GTCTTGTACCGACGACAGATCTTTCAA KWXXWWWWWWWTWTWWYYYYY______ PQ:i:0XQ:f:0.097178 XP:f:0.190338 HWUSI-EAS1720_0023_FC63B9YAAXX:6:120:19103:21529#0/1 0 chrU 9824433 0 1S26M * 0 0 GTCTTGTACCGACGACAGATCTTTCAA KWXXWWWWWWWTWTWWYYYYY______ PQ:i:0XQ:f:0.00486 XP:f:0.00952
Note that the CIGAR string (6th column) is 27M - indicating 27 aligned bases - in the first line, and 1S26M - indicating one soft-clipped base and 26 aligned bases - in the second line, while the mapping position (4th column) is one base further along in the second line than in the first line. The actual reference DNA at position 9824432 of 'chrU' is GTCTTGTACCGACGACAGATCTTTCAA, i.e. an exact match to the read sequence (10th column) in both lines.
My question is: how should I interpret pairs of lines such as these? Note that XP, the conditional probability of the read coming from the given position (last column) differs between the two possible alignments, with XP for the first line almost exactly 20 times the XP for the second line.
I notice that these pairs seem to only appear when the first base of the read sequence is a guanine (or the last base is a cytosine, if the read is marked as reverse-complemented). So, are these pairs of lines perhaps because of CAGE's bias for nonspecific guanines at the 5' end of tags, and this information has been given to the alignment algorithm, indicating (for this example) that more likely this given read corresponds to transcription start at position 9824432, but maybe it is an erroneous read due to the guanine bias, off by one base?
Do you have more information on how the bam files were treated.? The G bias is corrected post-alignment so may be the bam file has been processed to correct for it.. but then i don't know why both the copies of the same read (corrected/uncorrected) are kept.
I don't have any more information, I'm afraid. I would think that both copies are kept because when looking at a CAGE read that starts with a G and which matches a genome sequence starting with a G, you can't be sure if the transcriptional start site is actually that G residue or one residue later, with the G having been added to the 5' of the read artefactually.
Yah, one could do that. Though i think better would be to remove only the G's which are not matched to reference, or use a probabilistic approach like in this paper. I think in this case you should check the NH tag to make sure these double mapping reads are marked as multi-mapping and you can count only one copy of the tag, while running featurecounts or so.
Reads that have mismatch "G" in the first base is likely due to CAGE-seq bias. In such case, simply trim the first base if there is a mismatch.