I just used HISAT to analyze a human HCC RNA-seq dataset, and I compared those alignment summaries with those of Tophat2, and found some interesting difference:
HISAT2 with almost default parameters except
genome_snp_tran index provided in HISAT2 website
64562561 reads; of these: 64562561 (100.00%) were paired; of these: 6437600 (9.97%) aligned concordantly 0 times 49287413 (76.34%) aligned concordantly exactly 1 time 8837548 (13.69%) aligned concordantly >1 times ---- 6437600 pairs aligned 0 times concordantly or discordantly; of these: 12875200 mates make up the pairs; of these: 6859176 (53.27%) aligned 0 times 4898558 (38.05%) aligned exactly 1 time 1117466 (8.68%) aligned >1 times 94.69% overall alignment rate
Tophat2 with almost default parameters except also
--no-discordant, using Grch38.primary_assembly.genome.fa and gencode.v24.primary_assembly.annotation.gtf.
Left reads: Input : 64562561 Mapped : 59801300 (92.6% of input) of these: 2319944 ( 3.9%) have multiple alignments (7643 have >20) Right reads: Input : 64562561 Mapped : 59163077 (91.6% of input) of these: 2298674 ( 3.9%) have multiple alignments (8178 have >20) 92.1% overall read mapping rate. Aligned pairs: 55979777 of these: 1965301 ( 3.5%) have multiple alignments 223752 ( 0.4%) are discordant alignments 86.4% concordant pair alignment rate.
It seems HISAT2 got higher overall mapping rate and concordant pair alignment rate, but with lower unique concordant pair alignment rate.
And my questions are:
- Is it important or necessary to discard discordant pair alignment for PE?
- And how to explain the higher multiple alignments rate? because Tophat2 mapping reads to transcriptome before genome?
- Will the high multiple alignment rate affect the accuracy of abundance estimation of transcripts and genes?