Question: Tophat Produces Non-Standard Bam With Same Read Appearing With Distinct Sequences?
0
gravatar for user
6.4 years ago by
user790
United States
user790 wrote:

I recently mapped reads with Tophat to the genome and a GTF file of junctions. I did not restrict the mapping to uniquely mapping reads, so I allow multi-mapping reads.

The BAM file produced by Tophat contained a read with the same ID multiple times, which is expected. However, the read appeared in those multiple places with distinct sequences each time. The following read maps to 12 places, and here are four of those alignments:

HWI-ST333:3:1215:13855:84627#ATCACG    272    chr1    21047875    0    28M    *    0    0    AGAGATTTATACGATCTGAAGAGACACC    e^bhggfgfff^fdcfggfgccaSJ^Z^    AS:i:-12    XM:i:2XO:i:0    XG:i:0    MD:Z:1A22A3    NM:i:2    NH:i:12    CC:Z:=    CP:i:44000382    HI:i:0
HWI-ST333:3:1215:13855:84627#ATCACG    272    chr1    44000382    0    28M    *    0    0    AGAGATTTATACGATCTGAAGAGACACC    e^bhggfgfff^fdcfggfgccaSJ^Z^    AS:i:-12    XM:i:2XO:i:0    XG:i:0    MD:Z:1A22A3    NM:i:2    NH:i:12    CC:Z:=    CP:i:173433260    HI:i:1
HWI-ST333:3:1215:13855:84627#ATCACG    272    chr1    173433260    0    28M    *    0    0    AGAGATTTATACGATCTGAAGAGACACC    e^bhggfgfff^fdcfggfgccaSJ^Z^    AS:i:-12    XM:i:2XO:i:0    XG:i:0    MD:Z:1A22A3    NM:i:2    NH:i:12    CC:Z:chr10    CP:i:83790182    HI:i:2
HWI-ST333:3:1215:13855:84627#ATCACG    256    chr10    83790182    0    28M    *    0    0    GGTGTCTCTTCAGATCGTATAAATCTCT    ^Z^JSaccgfggfcdf^fffgfgghb^e    AS:i:-12    XM:i:2XO:i:0    XG:i:0    MD:Z:3T22T1    NM:i:2    NH:i:12    CC:Z:chr13    CP:i:23895518    HI:i:3

The first three occurrences of the read have the same sequence, but the fourth appearance has a different SEQ field. According to the SAM format, multi-mapping reads can have the SEQ field be '*' after the first alignment is listed to save space, but I cannot see how the very same read can appear with different sequences, as happens here. Is this a violation of the BAM/SAM format? Is it a Tophat error? thanks.

tophat rna-seq mapping bowtie sam • 1.9k views
ADD COMMENTlink modified 6.4 years ago by Istvan Albert ♦♦ 79k • written 6.4 years ago by user790
4
gravatar for Istvan Albert
6.4 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

The first three hits are on the reverse strand whereas the last is on the forward strand. The sequence is reverse complemented to account for that.

ADD COMMENTlink written 6.4 years ago by Istvan Albert ♦♦ 79k

if your bams are sorted, then what is the best way to get the alignments sorted by their quality (as judged by tophat) in the case of a multimapping read?

ADD REPLYlink written 6.4 years ago by user790

you should ask the above as a new question - adding it as a comment to an answer will not help with getting it answered.

ADD REPLYlink written 6.4 years ago by Istvan Albert ♦♦ 79k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1157 users visited in the last hour