Question: The comparison between HISAT2 and Tophat2
5
Peng Huang • 50 wrote:
Hi, everyone!
I just used HISAT to analyze a human HCC RNA-seq dataset, and I compared those alignment summaries with those of Tophat2, and found some interesting difference:
HISAT2 with almost default parameters except --no-discordant
, using genome_snp_tran
index provided in HISAT2 website
64562561 reads; of these:
64562561 (100.00%) were paired; of these:
6437600 (9.97%) aligned concordantly 0 times
49287413 (76.34%) aligned concordantly exactly 1 time
8837548 (13.69%) aligned concordantly >1 times
----
6437600 pairs aligned 0 times concordantly or discordantly; of these:
12875200 mates make up the pairs; of these:
6859176 (53.27%) aligned 0 times
4898558 (38.05%) aligned exactly 1 time
1117466 (8.68%) aligned >1 times
94.69% overall alignment rate
Tophat2 with almost default parameters except also --no-discordant
, using Grch38.primary_assembly.genome.fa and gencode.v24.primary_assembly.annotation.gtf.
Left reads:
Input : 64562561
Mapped : 59801300 (92.6% of input)
of these: 2319944 ( 3.9%) have multiple alignments (7643 have >20)
Right reads:
Input : 64562561
Mapped : 59163077 (91.6% of input)
of these: 2298674 ( 3.9%) have multiple alignments (8178 have >20)
92.1% overall read mapping rate.
Aligned pairs: 55979777
of these: 1965301 ( 3.5%) have multiple alignments
223752 ( 0.4%) are discordant alignments
86.4% concordant pair alignment rate.
It seems HISAT2 got higher overall mapping rate and concordant pair alignment rate, but with lower unique concordant pair alignment rate.
And my questions are:
- Is it important or necessary to discard discordant pair alignment for PE?
- And how to explain the higher multiple alignments rate? because Tophat2 mapping reads to transcriptome before genome?
- Will the high multiple alignment rate affect the accuracy of abundance estimation of transcripts and genes?
ADD COMMENT
• link
•
modified 2.4 years ago
by
_r_am ♦ 32k
•
written
4.9 years ago by
Peng Huang • 50
When Hisat was released there were concerns that HiSat was mapping majority of the reads to pseudogenes (given that pseudogenes show very little/no expression). I am not sure whats the status now, maybe you can check for youself.
I was also doing a similar comparison using the same annotation files, etc. to keep consistency. In 6 samples the average hisat2 mapping rate was 90.55%, while the tophat2 average rate was 90.9% - a slight difference. However, continuing throughout the pipeline to HTSeq-count, the count files are a little different - Here is a comparison of two of them from the same sample:
Hisat2:
Tophat2:
Furthermore, when looking at the results of DESeq between the two methods (organized by padj), the log2 fold changes appear to be relatively the same, although the p-value and padj values for hisat2 DESeq are much lower than the tophat2 DESeq values. Spearman's correlation of the ranks of the first 300 genes hisat2 vs. tophat2 = .7870; tophat2 vs. hisat2 = .9036 Here the headers from DESeq run:
Hisat2:
Tophat2:
So I too would like explanations on why using these two different methods produced somewhat different results.
Hi,
What was the length of your read pairs? Did you read and quality-trim them before alignment?
I am testing HISAT2 with SRA files: SRR1303996 and SRR1303997. Prior to alignment, I have used Trim_Galore! to trim adapter and low-quality bases.
HISAT returns the following stats:
I am worried about the relatively lower "(60.53%) aligned concordantly exactly 1 time." compared to you 76.34%.
In comparison, tophat2 returned me :
My question is whether HISAT2 should be given raw reads or trimmed reads? HISAT does have a soft-clipping penalty setting which may encourage the use of raw reads (?)
Best wishes, Kevin
This is a good question that I would also like to know. I don't think I have ever come across anything about trimming reads before mapping them though.
Well, trimming is certainly a bigger deal for DNA-sequencing and variant calling, but there are some questions out there for RNA-seq:
STAR seems to happily align uniquely 83.37% of the trimmed reads (TrimGalore).
I am interested in HISAT2 now (for alignment "against a population", including genetic variation), but I am not sure whether trimming reads before alignment is desirable or not, considering the stats that I posted above.
I have to admit that I am drifting away from the original question here.