Hi all,
I am processing and analysis of RNAseq data at the very beginning. I trim the RNAseq data with Trimmomatic and aligning the data to a reference genome using Hisat2. I trimmed the data with two different settings:
One with removes 10 bases from the beginning of the read (HEADCROP:10), and the other without this setting.
However, I found that there is a huge different rate of alignment for paired-end reads when I trim data differently, for example:
with HEADCROP:
43823976 reads; of these: 43823976 (100.00%) were paired; of these: 20376984 (46.50%) aligned concordantly 0 times 22619436 (51.61%) aligned concordantly exactly 1 time 827556 (1.89%) aligned concordantly >1 times ---- 20376984 pairs aligned concordantly 0 times; of these: 9403592 (46.15%) aligned discordantly 1 time ---- 10973392 pairs aligned 0 times concordantly or discordantly; of these: 21946784 mates make up the pairs; of these: 12208010 (55.63%) aligned 0 times 8593679 (39.16%) aligned exactly 1 time 1145095 (5.22%) aligned >1 times 86.07% overall alignment rate
without HEADCROP:
43953809 reads; of these: 43953809 (100.00%) were paired; of these: 7868205 (17.90%) aligned concordantly 0 times 34738253 (79.03%) aligned concordantly exactly 1 time 1347351 (3.07%) aligned concordantly >1 times ---- 7868205 pairs aligned concordantly 0 times; of these: 342967 (4.36%) aligned discordantly 1 time ---- 7525238 pairs aligned 0 times concordantly or discordantly; of these: 15050476 mates make up the pairs; of these: 12437657 (82.64%) aligned 0 times 2487014 (16.52%) aligned exactly 1 time 125805 (0.84%) aligned >1 times 85.85% overall alignment rate
Why the aligned concordantly 0 times and exactly 1 time will be so different?
Will the low rate of "aligned concordantly exactly 1 time" be the problem that may influence the follow-up analysis (e.g. counting gene in FeatureCounts)?
I will appreciate your help with this situation. Thank you very much for your time.
Best,
Yi-Ting Fang
Why did you remove 10 nt from the start of the read? This was clearly unnecessary as HISAT2 will soft-clip alignments when needed - unless otherwise specified. Your alignment rate is very good in the first place. It could be that you trimmed your reads too short and now they multi-map.
Because I found the percentage of ATCG change drastically at the beginning of the read, and I expect that remove 10 nt from start bases might improve the accuracy of mapping. Based on the principle of HISAT2, does it be redundant or even be bad to do this? Thank you.