I am new to use spliced-aligner like tophat2, I used Tophat2 v2.0.13 to do alignment on paired-end 50bp data. The command I used is straightforward
tophat2 -o ./MCF-7.debug.tophat2 -G $GTF -p 16 --fusion-search --fusion-min-dist 200000 ${bowtie2_index} ${fastqfiles}
However, when I am trying to get the unique and primary alignment from the result, I found something weird.
1. I found that a multiple alignment with two primary alignment . It seems contradictory because there should be only one primary alignment in a multiple alignment as sam format specification says:
Multiple mapping: The correct placement of a read may be ambiguous, e.g. due to repeats. In this case, there may be multiple read alignments for the same read. One of these alignments is considered primary.
See below:
samtools view -f 64 MCF-7.debug.tophat2.bam | grep --color "HWI-EAS418:2:98:923:1852"
---------------------------------------------------------
HWI-EAS418:2:98:923:1852 339 chr13 60240952 3 29M = 60240894 -79557 CCCTTCTGTCAACTGCACTTTCTGATTTT @7;>BBB@ABBACBAB>ABABBBBCABBB AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0 XF:Z:1 chr13-chr17 60240952 29M60320430F21M CCCTTCTGTCAACTGCACTTTCTGATTTTCAGGCAGTGTCTTCCATAAAT @7;>BBB@ABBACBAB>ABABBBBCABBBCC@BCBBBABCBBBCCCCCCB XP:Z:chr13 60240894 50M NH:i:2 CC:Z:= CP:i:60240952 HI:i:0
HWI-EAS418:2:98:923:1852 83 chr13 60240952 3 29M = 60240894 -50 CCCTTCTGTCAACTGCACTTTCTGATTTT @7;>BBB@ABBACBAB>ABABBBBCABBB AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0 XF:Z:1 chr13-chr17 60240952 29M58007535F21m CCCTTCTGTCAACTGCACTTTCTGATTTTCAGGCAGTGTCTTCCATAAAT @7;>BBB@ABBACBAB>ABABBBBCABBBCC@BCBBBABCBBBCCCCCCB XP:Z:chr13 60240894 50M NH:i:2 HI:i:1
HWI-EAS418:2:98:923:1852 83 chr17 58007515 3 21M chr13 60240894 -50 ATTTATGGAAGACACTGCCTG BCCCCCCBBBCBABBBCB@CC AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0 XF:Z:2 chr13-chr17 60240952 29M58007535F21m CCCTTCTGTCAACTGCACTTTCTGATTTTCAGGCAGTGTCTTCCATAAAT @7;>BBB@ABBACBAB>ABABBBBCABBBCC@BCBBBABCBBBCCCCCCB XP:Z:chr13 60240894 50 NH:i:2 HI:i:1
HWI-EAS418:2:98:923:1852 339 chr17 60320430 3 21M chr13 60240894 -79557 CAGGCAGTGTCTTCCATAAAT CC@BCBBBABCBBBCCCCCCB AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0 XF:Z:2 chr13-chr17 60240952 29M60320430F21M CCCTTCTGTCAACTGCACTTTCTGATTTTCAGGCAGTGTCTTCCATAAAT @7;>BBB@ABBACBAB>ABABBBBCABBBCC@BCBBBABCBBBCCCCCCB XP:Z:chr13 60240894 50 NH:i:2 CC:Z:= CP:i:60240952 HI:i:0
----------------------------------------------------------------
So how to evaluate these alignments?
2. The NH tag Meaning?
I found NH:i:2
are set in these alignments showed above, it seems that it is the number of PRIMARY alignment . But when I check the second in the pair, the NH tag is set to NH:i:2
when there are only ONE PRIMARY alignment, see as below:
samtools view -F 64 MCF-7.debug.tophat2.bam | grep --color "HWI-EAS418:2:98:923:1852"
------------------------------------
HWI-EAS418:2:98:923:1852 419 chr13 60240894 3 50M = 60240952 79557 GCTCCTTGGCGACTGGAGTCCTTGTTGAGTTGCAATTGATATTGTAGTGT BCCBBBBBBBB?>B?B?B@BBAAB>BB?B@BBB>@?BB@B?BBB=8@9B> AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XP:Z:chr13-chr17 60240952 29M60320429F21M NH:i:2 CC:Z:= CP:i:60240894 HI:i:0
HWI-EAS418:2:98:923:1852 163 chr13 60240894 3 50M = 60240952 50 GCTCCTTGGCGACTGGAGTCCTTGTTGAGTTGCAATTGATATTGTAGTGT BCCBBBBBBBB?>B?B?B@BBAAB>BB?B@BBB>@?BB@B?BBB=8@9B> AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU XP:Z:chr13-chr17 60240952 29M58007534F21m NH:i:2 HI:i:1
----------------------------------------------------------
Thanks in advance
This is not a bug, it appears in other aligner such as gsnap. see that:
Ah, you appear to have duplicate read names associated with different sequences (or you have paired end reads and are running things incorrectly).