Question

The comparison between HISAT2 and Tophat2

5

Entering edit mode

8.2 years ago

Peng Huang ▴ 50

Hi, everyone!

I just used HISAT to analyze a human HCC RNA-seq dataset, and I compared those alignment summaries with those of Tophat2, and found some interesting difference:

HISAT2 with almost default parameters except --no-discordant, using genome_snp_tran index provided in HISAT2 website

64562561 reads; of these:
  64562561 (100.00%) were paired; of these:
    6437600 (9.97%) aligned concordantly 0 times
    49287413 (76.34%) aligned concordantly exactly 1 time
    8837548 (13.69%) aligned concordantly >1 times
    ----
    6437600 pairs aligned 0 times concordantly or discordantly; of these:
      12875200 mates make up the pairs; of these:
        6859176 (53.27%) aligned 0 times
        4898558 (38.05%) aligned exactly 1 time
        1117466 (8.68%) aligned >1 times
94.69% overall alignment rate

Tophat2 with almost default parameters except also --no-discordant, using Grch38.primary_assembly.genome.fa and gencode.v24.primary_assembly.annotation.gtf.

Left reads:
          Input     :  64562561
           Mapped   :  59801300 (92.6% of input)
            of these:   2319944 ( 3.9%) have multiple alignments (7643 have >20)
Right reads:
          Input     :  64562561
           Mapped   :  59163077 (91.6% of input)
            of these:   2298674 ( 3.9%) have multiple alignments (8178 have >20)
92.1% overall read mapping rate.

Aligned pairs:  55979777
     of these:   1965301 ( 3.5%) have multiple alignments
                  223752 ( 0.4%) are discordant alignments
86.4% concordant pair alignment rate.

It seems HISAT2 got higher overall mapping rate and concordant pair alignment rate, but with lower unique concordant pair alignment rate.

And my questions are:

Is it important or necessary to discard discordant pair alignment for PE?
And how to explain the higher multiple alignments rate? because Tophat2 mapping reads to transcriptome before genome?
Will the high multiple alignment rate affect the accuracy of abundance estimation of transcripts and genes?

RNA-Seq Tophat2 HISAT2 • 14k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by Peng Huang ▴ 50

2

Entering edit mode

When Hisat was released there were concerns that HiSat was mapping majority of the reads to pseudogenes (given that pseudogenes show very little/no expression). I am not sure whats the status now, maybe you can check for youself.

STAR versus HiSAT: many reads mapping to pseudogenes with HiSAT. Does anybody else see this? @nextgenseek pic.twitter.com/AfOepI2du0
— Nicolas Robine (@notSoJunkDNA) March 17, 2015

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by poisonAlien ★ 3.2k

1

Entering edit mode

I was also doing a similar comparison using the same annotation files, etc. to keep consistency. In 6 samples the average hisat2 mapping rate was 90.55%, while the tophat2 average rate was 90.9% - a slight difference. However, continuing throughout the pipeline to HTSeq-count, the count files are a little different - Here is a comparison of two of them from the same sample:

Hisat2:

__no_feature    9254816
__ambiguous 11767193
__too_low_aQual 2972408
__not_aligned   3125815
__alignment_not_unique  2374833

Tophat2:

__no_feature    9979285
__ambiguous 13367773
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  16843889

Furthermore, when looking at the results of DESeq between the two methods (organized by padj), the log2 fold changes appear to be relatively the same, although the p-value and padj values for hisat2 DESeq are much lower than the tophat2 DESeq values. Spearman's correlation of the ranks of the first 300 genes hisat2 vs. tophat2 = .7870; tophat2 vs. hisat2 = .9036 Here the headers from DESeq run:

Hisat2:

          baseMean  log2FoldChange  lfcSE      stat      pvalue      padj
HES5    3003.654134 2.067316051 0.189525161 10.90787124 1.06E-27    2.26E-23
HES5.1  1737.351871 2.593695374 0.242507668 10.69531284 1.07E-26    1.14E-22
ESR10   467.6392656 1.836371222 0.232848016 7.886565886 3.11E-15    2.21E-11
MMRN2   129.2806191 2.026534357 0.262351706 7.724494668 1.12E-14    5.99E-11
UNNAMED 3631.038066 1.770765181 0.233826029 7.573002851 3.65E-14    1.56E-10

Tophat2:

          baseMean  log2FoldChange  lfcSE      stat      pvalue      padj
HES5.1  1829.39526  2.623640327 0.23401902  11.21122687 3.59E-29    6.17E-25
HES5    3212.256203 2.071498117 0.193685757 10.69514946 1.07E-26    9.20E-23
ESR10   411.4004291 1.713855178 0.212834406 8.052528754 8.11E-16    4.64E-12
MMRN2   132.5164544 2.024153506 0.26188893  7.729053336 1.08E-14    3.72E-11
UNNAMED 3769.184504 1.810110457 0.234047101 7.733958037 1.04E-14    3.72E-11

So I too would like explanations on why using these two different methods produced somewhat different results.

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 7.8 years ago by rrcutler ▴ 120

0

Entering edit mode

Hi,

What was the length of your read pairs? Did you read and quality-trim them before alignment?

I am testing HISAT2 with SRA files: SRR1303996 and SRR1303997. Prior to alignment, I have used Trim_Galore! to trim adapter and low-quality bases.

HISAT returns the following stats:

10116026 reads; of these:
  10116026 (100.00%) were paired; of these:
    2370051 (23.43%) aligned concordantly 0 times
    6122963 (60.53%) aligned concordantly exactly 1 time
    1623012 (16.04%) aligned concordantly >1 times
    ----
    2370051 pairs aligned concordantly 0 times; of these:
      366430 (15.46%) aligned discordantly 1 time
    ----
    2003621 pairs aligned 0 times concordantly or discordantly; of these:
      4007242 mates make up the pairs; of these:
        1824624 (45.53%) aligned 0 times
        1351025 (33.71%) aligned exactly 1 time
        831593 (20.75%) aligned >1 times
90.98% overall alignment rate

I am worried about the relatively lower "(60.53%) aligned concordantly exactly 1 time." compared to you 76.34%.

In comparison, tophat2 returned me :

Left reads:
          Input     :   9624707
           Mapped   :   5732691 (59.6% of input)
            of these:    117271 ( 2.0%) have multiple alignments (713 have >20)
Right reads:
          Input     :   9624707
           Mapped   :   5732691 (59.6% of input)
            of these:    117271 ( 2.0%) have multiple alignments (713 have >20)
59.6% overall read mapping rate.

Aligned pairs:   5732691
     of these:    117271 ( 2.0%) have multiple alignments
                   22586 ( 0.4%) are discordant alignments
59.3% concordant pair alignment rate.

My question is whether HISAT2 should be given raw reads or trimmed reads? HISAT does have a soft-clipping penalty setting which may encourage the use of raw reads (?)

Best wishes, Kevin

ADD REPLY • link 7.8 years ago by k.rue-albrecht ▴ 30

0

Entering edit mode

This is a good question that I would also like to know. I don't think I have ever come across anything about trimming reads before mapping them though.

ADD REPLY • link 7.8 years ago by rrcutler ▴ 120

0

Entering edit mode

Well, trimming is certainly a bigger deal for DNA-sequencing and variant calling, but there are some questions out there for RNA-seq:

STAR seems to happily align uniquely 83.37% of the trimmed reads (TrimGalore).

I am interested in HISAT2 now (for alignment "against a population", including genetic variation), but I am not sure whether trimming reads before alignment is desirable or not, considering the stats that I posted above.

I have to admit that I am drifting away from the original question here.

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 7.8 years ago by k.rue-albrecht ▴ 30