Entering edit mode
7.1 years ago
Peng Huang ▴ 50
I just used HISAT to analyze a human HCC RNA-seq dataset, and I compared those alignment summaries with those of Tophat2, and found some interesting difference:
HISAT2 with almost default parameters except
genome_snp_tran index provided in HISAT2 website
64562561 reads; of these: 64562561 (100.00%) were paired; of these: 6437600 (9.97%) aligned concordantly 0 times 49287413 (76.34%) aligned concordantly exactly 1 time 8837548 (13.69%) aligned concordantly >1 times ---- 6437600 pairs aligned 0 times concordantly or discordantly; of these: 12875200 mates make up the pairs; of these: 6859176 (53.27%) aligned 0 times 4898558 (38.05%) aligned exactly 1 time 1117466 (8.68%) aligned >1 times 94.69% overall alignment rate
Tophat2 with almost default parameters except also
--no-discordant, using Grch38.primary_assembly.genome.fa and gencode.v24.primary_assembly.annotation.gtf.
Left reads: Input : 64562561 Mapped : 59801300 (92.6% of input) of these: 2319944 ( 3.9%) have multiple alignments (7643 have >20) Right reads: Input : 64562561 Mapped : 59163077 (91.6% of input) of these: 2298674 ( 3.9%) have multiple alignments (8178 have >20) 92.1% overall read mapping rate. Aligned pairs: 55979777 of these: 1965301 ( 3.5%) have multiple alignments 223752 ( 0.4%) are discordant alignments 86.4% concordant pair alignment rate.
It seems HISAT2 got higher overall mapping rate and concordant pair alignment rate, but with lower unique concordant pair alignment rate.
And my questions are:
- Is it important or necessary to discard discordant pair alignment for PE?
- And how to explain the higher multiple alignments rate? because Tophat2 mapping reads to transcriptome before genome?
- Will the high multiple alignment rate affect the accuracy of abundance estimation of transcripts and genes?
When Hisat was released there were concerns that HiSat was mapping majority of the reads to pseudogenes (given that pseudogenes show very little/no expression). I am not sure whats the status now, maybe you can check for youself.
I was also doing a similar comparison using the same annotation files, etc. to keep consistency. In 6 samples the average hisat2 mapping rate was 90.55%, while the tophat2 average rate was 90.9% - a slight difference. However, continuing throughout the pipeline to HTSeq-count, the count files are a little different - Here is a comparison of two of them from the same sample:
Furthermore, when looking at the results of DESeq between the two methods (organized by padj), the log2 fold changes appear to be relatively the same, although the p-value and padj values for hisat2 DESeq are much lower than the tophat2 DESeq values. Spearman's correlation of the ranks of the first 300 genes hisat2 vs. tophat2 = .7870; tophat2 vs. hisat2 = .9036 Here the headers from DESeq run:
So I too would like explanations on why using these two different methods produced somewhat different results.
What was the length of your read pairs? Did you read and quality-trim them before alignment?
I am testing HISAT2 with SRA files: SRR1303996 and SRR1303997. Prior to alignment, I have used Trim_Galore! to trim adapter and low-quality bases.
HISAT returns the following stats:
I am worried about the relatively lower "(60.53%) aligned concordantly exactly 1 time." compared to you 76.34%.
In comparison, tophat2 returned me :
My question is whether HISAT2 should be given raw reads or trimmed reads? HISAT does have a soft-clipping penalty setting which may encourage the use of raw reads (?)
Best wishes, Kevin
This is a good question that I would also like to know. I don't think I have ever come across anything about trimming reads before mapping them though.
Well, trimming is certainly a bigger deal for DNA-sequencing and variant calling, but there are some questions out there for RNA-seq:
STAR seems to happily align uniquely 83.37% of the trimmed reads (TrimGalore).
I am interested in HISAT2 now (for alignment "against a population", including genetic variation), but I am not sure whether trimming reads before alignment is desirable or not, considering the stats that I posted above.
I have to admit that I am drifting away from the original question here.