Question

High number of secondary alignments with HISAT2

1

Entering edit mode

6.2 years ago

JJ ▴ 670

Hi all,

So I am working with a public dataset and I am a bit worried about the high number of secondary alignment I get. Here is the samtools flagstat output. I usually get a much lower number. What could be the cause of such a high number of secondary alignments? What number is still acceptable? Do I need to dump this dataset? This is human mRNA. I used HISAT2 with default settings.

53638643 + 0 in total (QC-passed reads + QC-failed reads)
27516323 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
50786919 + 0 mapped (94.68% : N/A)
26122320 + 0 paired in sequencing
13061160 + 0 read1
13061160 + 0 read2
21125408 + 0 properly paired (80.87% : N/A)
22335136 + 0 with itself and mate mapped
935460 + 0 singletons (3.58% : N/A)
213596 + 0 with mate mapped to a different chr
131241 + 0 with mate mapped to a different chr (mapQ>=5)

RNA-Seq • 3.6k views

ADD COMMENT • link updated 6.1 years ago by Biostar 20 • written 6.2 years ago by JJ ▴ 670

0

Entering edit mode

What is the reference where you mapped these RNASeq reads on?

ADD REPLY • link 6.2 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Current Gencode release - human primary assembly and corresponding gtf were used to build the index. I also mapped it using the standard index provided by HISAT2 - almost the same result.

ADD REPLY • link 6.2 years ago by JJ ▴ 670

1

Entering edit mode

Anyway, with the default --score-min, --mp and gap opening / extending penalties you allow for quite some mismatches / gaps in your read mapping. This means that you likely get secondary alignments on the second copies of those genes, or on similar regions. Try to increase the penalties or to lower the accepted alignment score. If only the number of reads with secondary alignments goes down, then you're good, if also the "mapped" goes down then you're being too strict. This might be a workaround to test this.

ADD REPLY • link 6.2 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

So I set the --score-min to --score-min L,0,-0.6 and I get now this:

44779604 + 0 in total (QC-passed reads + QC-failed reads)
18657284 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
44049017 + 0 mapped (98.37% : N/A)
26122320 + 0 paired in sequencing
13061160 + 0 read1
13061160 + 0 read2
23705620 + 0 properly paired (90.75% : N/A)
25214764 + 0 with itself and mate mapped
176969 + 0 singletons (0.68% : N/A)
274478 + 0 with mate mapped to a different chr
212262 + 0 with mate mapped to a different chr (mapQ>=5)

Better. Are there any guidelines for this parameter? Thanks a lot, Is there any documentation on this parameter apart from what is stated in the manual?

ADD REPLY • link 6.2 years ago by JJ ▴ 670

2

Entering edit mode

I think the documentation is made on discussions on this forum :) What I can only suggest you is to spend some time on this parameter every time you map reads. Multiply the read length by the third number (-0.6) and then subtract the second number (0): this way you get the maximum penalty. For reads of 125 nt, if I set it to L,2,-0.5 I will get -((125*0.5)+2) = -77. If my mismatch penalty is --mp 7,7 I will get at most 11 mismatches. If my gap penalty is --rdg 7,5 I will get a gap of most length 14 (77-7 opening, 70/5 for extension).

ADD REPLY • link 6.2 years ago by Matteo Schiavinato ★ 3.6k