Recently, I’m analyzing a transcriptomic dataset compromised of 66 RNA-seq samples (pair-end, 150bp, average depth 60M reads). After adapter cutting and low quality reads trimming by using Cutadapt and Trimmomatic respectively, read alignment was conducted by using the newest version of HISAT2 with default options. There is an example:
hisat2 -p 2 --dta -x /data/huangp/HCC_RNA-seq/genome_snp_tran -q -1 data/huangp/HCC/WGC066460R_paired_1.fastq -2 /data/huangp/HCC/WGC066460R_paired_2.fastq -S data/huangp/HCC/WGC066460R_discordant_enable.sam >WGC066460R_summary_discordant_enable.txt &*
And the alignment summary metrics indicated that all 66 samples had very high overall mapping rate (average > 97.4%) but 20 of them showed lower concordantly aligned pair rate (average 71.78% vs average 89.43%) than other 40 samples (there were two alignment summary metrics of two sets of samples below).
[huangp@localhost HCC]$ tail -n 15 ./WGC066520R_summary_discordant_enable.txt
74127014 reads; of these: 74127014 (100.00%) were paired; of these:
21829528 (29.45%) aligned concordantly 0 times 37429446 (50.49%) aligned concordantly exactly 1 time 14868040 (20.06%) aligned concordantly >1 times ---- 21829528 pairs aligned concordantly 0 times; of these: 16038791 (73.47%) aligned discordantly 1 time ---- 5790737 pairs aligned 0 times concordantly or discordantly; of these: 11581474 mates make up the pairs; of these: 2944609 (25.43%) aligned 0 times 3136582 (27.08%) aligned exactly 1 time 5500283 (47.49%) aligned >1 times
98.01% overall alignment rate
[huangp@localhost HCC]$ tail -n 15 ./WGC066460R_summary_discordant_enable.txt
71399828 reads; of these:
71399828 (100.00%) were paired; of these:
6859529 (9.61%) aligned concordantly 0 times 51649822 (72.34%) aligned concordantly exactly 1 time 12890477 (18.05%) aligned concordantly >1 times ---- 6859529 pairs aligned concordantly 0 times; of these: 3168251 (46.19%) aligned discordantly 1 time ---- 3691278 pairs aligned 0 times concordantly or discordantly; of these: 7382556 mates make up the pairs; of these: 4285338 (58.05%) aligned 0 times 1810517 (24.52%) aligned exactly 1 time 1286701 (17.43%) aligned >1 times
97.00% overall alignment rate
The sequencing accompany told me that these 20 samples were constructing library simultaneously but sequencing in different flow cells of same sequencing machine (the sequencing platform was Illumina X10).
At first, I thought the maximum fragment length constraint might be too strict to meet, so I changed it from default 500bp to 800bp and even 1000bp, but few pairs increased to be concordantly aligned. Then I am wondering whether the fragment length calculation in HISAT2 considering the intron length for exon-spanning reads, since many introns could be very long? But if it is true, why same options gave high concordantly aligned pair rate for other 46 samples?
Any advice will be appreciated!
I'm facing to the same question now. Have you got the answer?