I downloaded mapped SAM/BAM files from modEncode CAGE-seq data.
I looked at SAM files and observed some inconsistency on the way how the first base (transcription start sites) is called. When there is a mismatch on first base, by either "N" or due to insertion of "G" on first base, that will shift TSS by one base. I have shown a example of where transcription start site (TSS) is correct and another example where TSS has shifted by one base due to mismatch on first base.
TSS is 1450200
chr3R 1450200 1450227 HWUSI-EAS1720_0021_FC63A8AAAXX:2:4:4359:14455#0/1 0 + 0 27M * 0 0 CTTTCCGTGCGGTTCGTAAAAATGACT caffffcaffffcffffcfdfff_efd PQ:i:16
chr3R 1450200 1450227 HWUSI-EAS1720_0021_FC63A8AAAXX:2:6:8037:15660#0/1 0 + 0 27M * 0 0 CTTTCCGTGCGGTTCGTAAAAATGACT hhhhhhhhhhfhhhhfeffhhgfdehg PQ:i:19
Incorrect TSS due to mismatch on first base
TSS is 1450199 instead of 1450200
Here,either "N" is inserted on first base or "G" added by CAGE protocol. Both of these result in TSS being different by one nucleotide.
chr3R 1450199 1450226 HWUSI-EAS1720_0021_FC63A8AAAXX:2:69:12174:14031#0/1 0 + 0 27M * 0 0 NCTTTCCGTGCGGTTCGTAAAAATGAC Geeedefadffffdfffdffffadfef PQ:i:1
chr3R 1450199 1450226 HWUSI-EAS1720_0021_FC63A8AAAXX:2:84:2284:6722#0/1 0 + 0 27M * 0 0 NCTTTCCGTGCGGTTCGTAAAAATGAC F]b``bffcfcggcggfd__febbbBB PQ:i:0
chr3R 1450199 1450226 HWUSI-EAS1720_0021_FC63A8AAAXX:2:100:6796:15301#0/1 0 + 0 27M * 0 0 GCTTTCCGTGCGGTTCGTAAAAATGAC Qfffcfffdffffbfffdccadd^Wb` PQ:i:0
How can i correct these in my SAM file ? Basically if there is a mismatch on first base, TSS info should be corrected. So on this case, if "N" or "G" is clipped, it's TSS should be 1450200.
I looked at CIGAR information, but it appears to be same "27M" on all. Any help is appreciated. Thank you !!!