Question

Mismatch on first base gives wrong TSS

1

Entering edit mode

8.4 years ago

Chirag Nepal ★ 2.4k

I downloaded mapped SAM/BAM files from modEncode CAGE-seq data.

I looked at SAM files and observed some inconsistency on the way how the first base (transcription start sites) is called. When there is a mismatch on first base, by either "N" or due to insertion of "G" on first base, that will shift TSS by one base. I have shown a example of where transcription start site (TSS) is correct and another example where TSS has shifted by one base due to mismatch on first base.

Correct mapping

TSS is 1450200

chr3R 1450200 1450227 HWUSI-EAS1720_0021_FC63A8AAAXX:2:4:4359:14455#0/1 0 + 0 27M * 0 0 CTTTCCGTGCGGTTCGTAAAAATGACT caffffcaffffcffffcfdfff_efd PQ:i:16

chr3R 1450200 1450227 HWUSI-EAS1720_0021_FC63A8AAAXX:2:6:8037:15660#0/1 0 + 0 27M * 0 0 CTTTCCGTGCGGTTCGTAAAAATGACT hhhhhhhhhhfhhhhfeffhhgfdehg PQ:i:19

Incorrect TSS due to mismatch on first base

TSS is 1450199 instead of 1450200

Here,either "N" is inserted on first base or "G" added by CAGE protocol. Both of these result in TSS being different by one nucleotide.

chr3R 1450199 1450226 HWUSI-EAS1720_0021_FC63A8AAAXX:2:69:12174:14031#0/1 0 + 0 27M * 0 0 NCTTTCCGTGCGGTTCGTAAAAATGAC Geeedefadffffdfffdffffadfef PQ:i:1

chr3R 1450199 1450226 HWUSI-EAS1720_0021_FC63A8AAAXX:2:84:2284:6722#0/1 0 + 0 27M * 0 0 NCTTTCCGTGCGGTTCGTAAAAATGAC F]b``bffcfcggcggfd__febbbBB PQ:i:0

chr3R 1450199 1450226 HWUSI-EAS1720_0021_FC63A8AAAXX:2:100:6796:15301#0/1 0 + 0 27M * 0 0 GCTTTCCGTGCGGTTCGTAAAAATGAC Qfffcfffdffffbfffdccadd^Wb` PQ:i:0

How can i correct these in my SAM file ? Basically if there is a mismatch on first base, TSS info should be corrected. So on this case, if "N" or "G" is clipped, it's TSS should be 1450200.

I looked at CIGAR information, but it appears to be same "27M" on all. Any help is appreciated. Thank you !!!

first-base mismatch clip-mismatch CIGAR CAGE • 2.7k views

ADD COMMENT • link updated 8.2 years ago by Charles Plessy ★ 2.9k • written 8.4 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

It sounds like the simplest approach would be to re-map using local alignments.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Could you please elaborate it. What you meant by local alignment ? How is that different from mapping by bwa/bowtie2 ? I am thinking of remapping using bowtie2, but differences in first base due to mismatch will again show up, isn't it.

ADD REPLY • link 8.4 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

Local alignment: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-versus-local-alignment

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

What is the reference nucleotide at position 1450199? To me, it looks like a proper alignment with all matches. What makes you think the first position is a mismatch with N or G?

Maybe you can add correction by looking at the distribution of 5' ends and inferring the most prevalent position or a narrow region? Adding info about N positions to that correction is another approach.

Also, your sequence around TSS (CTTTCCGG) looks strange to me. What are the species and gene you are studying?

ADD REPLY • link 8.4 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Reference nucleotide at 1450199 is also C. The insertion of "G" at first base is known bias of CAGE (though it does not occur in all reads), and it is filtered out to get correct information about TSS. It seems modEncode did not correct this.

TSS (CTTTCCGG), is perfectly valid. The first nucleotide (TSS) of Ribosomal protein genes is anchored by "C", and surrounded by pyrimidine sequences, which is known for long time. This was an example of ribosomal protein from drosophila melaogaster.

ADD REPLY • link 8.4 years ago by Chirag Nepal ★ 2.4k

score 3 · Accepted Answer · 2017-04-22

Hi Chirag,

on my side, I usually work on the CAGE data after it is transformed to a BED format, and during that transformation I apply a naive correction for the extra Gs. My workflow is paired-end, so it does not directly apply to the modENCODE data, but for the sake of the example, here is an extract from the source code of the pairedBamToBed12 tool that we are using.

void SimpleGCorrection(const BamAlignment& bam1, const BamAlignment& bam2, const string strand,
                       unsigned int& alignmentStart, unsigned int& alignmentEnd,
                       vector<int>& blockLengths, vector<int>& blockStarts) {
    string md;
    if ( (strand == "+") & (FirstBase(bam1) == "G") )  {
        bam1.GetTag("MD", md);
        md = md.substr(0,2);
        if (md == "0A" || md == "0C" || md == "0T")
            CutOneLeft(alignmentStart, blockLengths, blockStarts);
    }
    if ( (strand == "-") & (LastBase(bam2) == "C") ) {
        bam2.GetTag("MD", md);
        md = md.substr(md.length() -2, 2);
        if (md == "A0" || md == "G0" || md == "T0")
            CutOneRight(alignmentEnd, blockLengths);
    }
}

And here is the disclaimer that I added in our documentation:

NOTE: CAGE methods sometimes add an extra G at the beginning of the cDNAs (see http://population-transcriptomics.org/nanoCAGE/#extra-G). This leads to 1-base shifts of some TSS peaks. From version 1.2, pairedBamToBed12 provides an experimental option, -extraG to shift the start or end (according to the strand) of the output of one base when a G mismatch is detected on the first base of Read1. This is a very naive implementation and a more detailed description of the problem may be found in the supplemental material of the FANTOM3 main article. Thus, the -extraG option available here is not entierly satisfactory and may be removed in the future. A better approach for instance would be to post-process the BAM file instead of implementing a correction here.

I hope it helps.