High number of secondary alignments with HISAT2
0
1
Entering edit mode
6.2 years ago
JJ ▴ 670

Hi all,

So I am working with a public dataset and I am a bit worried about the high number of secondary alignment I get. Here is the samtools flagstat output. I usually get a much lower number. What could be the cause of such a high number of secondary alignments? What number is still acceptable? Do I need to dump this dataset? This is human mRNA. I used HISAT2 with default settings.

53638643 + 0 in total (QC-passed reads + QC-failed reads)
27516323 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
50786919 + 0 mapped (94.68% : N/A)
26122320 + 0 paired in sequencing
13061160 + 0 read1
13061160 + 0 read2
21125408 + 0 properly paired (80.87% : N/A)
22335136 + 0 with itself and mate mapped
935460 + 0 singletons (3.58% : N/A)
213596 + 0 with mate mapped to a different chr
131241 + 0 with mate mapped to a different chr (mapQ>=5)
RNA-Seq • 3.6k views
ADD COMMENT
0
Entering edit mode

What is the reference where you mapped these RNASeq reads on?

ADD REPLY
0
Entering edit mode

Current Gencode release - human primary assembly and corresponding gtf were used to build the index. I also mapped it using the standard index provided by HISAT2 - almost the same result.

ADD REPLY
1
Entering edit mode

Anyway, with the default --score-min, --mp and gap opening / extending penalties you allow for quite some mismatches / gaps in your read mapping. This means that you likely get secondary alignments on the second copies of those genes, or on similar regions. Try to increase the penalties or to lower the accepted alignment score. If only the number of reads with secondary alignments goes down, then you're good, if also the "mapped" goes down then you're being too strict. This might be a workaround to test this.

ADD REPLY
0
Entering edit mode

So I set the --score-min to --score-min L,0,-0.6 and I get now this:

44779604 + 0 in total (QC-passed reads + QC-failed reads)
18657284 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
44049017 + 0 mapped (98.37% : N/A)
26122320 + 0 paired in sequencing
13061160 + 0 read1
13061160 + 0 read2
23705620 + 0 properly paired (90.75% : N/A)
25214764 + 0 with itself and mate mapped
176969 + 0 singletons (0.68% : N/A)
274478 + 0 with mate mapped to a different chr
212262 + 0 with mate mapped to a different chr (mapQ>=5)

Better. Are there any guidelines for this parameter? Thanks a lot, Is there any documentation on this parameter apart from what is stated in the manual?

ADD REPLY
2
Entering edit mode

I think the documentation is made on discussions on this forum :) What I can only suggest you is to spend some time on this parameter every time you map reads. Multiply the read length by the third number (-0.6) and then subtract the second number (0): this way you get the maximum penalty. For reads of 125 nt, if I set it to L,2,-0.5 I will get -((125*0.5)+2) = -77. If my mismatch penalty is --mp 7,7 I will get at most 11 mismatches. If my gap penalty is --rdg 7,5 I will get a gap of most length 14 (77-7 opening, 70/5 for extension).

ADD REPLY

Login before adding your answer.

Traffic: 2386 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6