Question

Confusing RNA-seq Alignment Stats (HISAT2 & Qualimap)

1

Entering edit mode

4.5 years ago

JJ ▴ 680

I am confused about the alignment stats I am getting and I really hope someone can explain them to me!

So I've used HISAT2 with default parameters using the grch38_tra index available. The results that HISAT2 is reporting back to me look fine to me. See below for an example, where I have an alignment rate of ~ 83 % :

    5389593 (28.98%) aligned concordantly 0 times
    11974983 (64.39%) aligned concordantly exactly 1 time
    1233844 (6.63%) aligned concordantly >1 times
    ----
    5389593 pairs aligned concordantly 0 times; of these:
      1021332 (18.95%) aligned discordantly 1 time
    ----
    4368261 pairs aligned 0 times concordantly or discordantly; of these:
      8736522 mates make up the pairs; of these:
        6676246 (76.42%) aligned 0 times
        1714031 (19.62%) aligned exactly 1 time
        346245 (3.96%) aligned >1 times

This makes sense to me but when I look at the qualimap results I am confused:

Number of mapped reads (left/right): 15,693,967 / 14,826,627
Number of aligned pairs (without duplicates): 13,208,827
Total number of alignments: 42,737,940
Number of secondary alignments: 12,217,346
Number of non-unique alignments: 15,018,799
Aligned to genes: 10,778,652
Ambiguous alignments: 1,313,140
No feature assigned: 15,611,447
Missing chromosome in annotation: 15,902
Not aligned: 6,676,246
Strand specificity estimation (fwd/rev):  0.03 / 0.97

So, what really threw me was the Total number of alignments: 42,737,940

15,693,967 + 14,826,627 = 30,520,594 reads

this matches the HISAT2 results: 11974983*2 + 1233844*2 + 1021332*2 + 1714031 + 346245 = 30,520,594 reads

42,737,940 - 30,520,594 = 12,217,346 secondary alignments - this seems a lot and now I am worried something has gone wrong...

But HISAT2 says 1233844 (6.63%) aligned concordantly >1 times and 346245 (3.96%) aligned >1 times - this doesn't seem so bad.

How does this go together? Does this mean that a small number of reads map very often? As far as I know, HISAT2 allows a maximum of k=5 distinct alignments in default mode. Does it mean that most of the 1233844*2 + 346245 map around 5 times (and possibly more often if I would have allowed for a higher k)?

Is this how the Number of secondary alignments and the Number of non-unique alignments relate to each other? Number of non-unique alignments would then be secondary alignments plus the number of multi mappers set as primary: 1233844*2 + 346245 + 12,217,346 which is close to Number of non-unique alignments: 15,018,799

Is this something to worry about? I see this with most of my samples. What do you use as cutoff/threshold for multi mappings as a quality control for your sample? Thanks for your input!

RNA-seq • 2.0k views

ADD COMMENT • link updated 4.5 years ago by Jianyu ▴ 580 • written 4.5 years ago by JJ ▴ 680

1

Entering edit mode

Explanation about HISAT stats could be found here

A: Evaluation of HISAT2 Alignment Result

ADD REPLY • link 4.5 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Thanks for the link - I understand the HISAT2 results - my question was more regarding the Qualimap results and if the Total number of alignments / Number of secondary alignments is too high. Having said that I also get similar results for human Encode samples.

ADD REPLY • link 4.5 years ago by JJ ▴ 680

0

Entering edit mode

How long are your reads?

ADD REPLY • link 4.5 years ago by shunyip ▴ 250

0

Entering edit mode

The reads are 100bp long and paired-end

ADD REPLY • link 4.5 years ago by JJ ▴ 680

score 2 · Answer 1 · 2019-10-24

I don't see any big problem with your mapping result. It is quite normal to see some parts of reads mapped to multiple places on the genome.

The cutoff/threshold is very hard to determine (for me) because it largely depends on what type of experiments you have done. For example, if some genes are located in regions with very low sequence complexity, it will be very easy to get some reads mapped to multiple regions. Also, short reads are easily mapped to multiple places, I think that's why @shunyip ask you the length of your reads. Moreover, different genomes, mutations, cell lines all have a great impact on what kind of reads you got.

So I think you don't need to worry about your mapping results. Maybe focusing on other QC results of Qualimap is better.