Different alignment rates for Hisat2 and STAR, Hisat2 has lower alignment rate and STAR have many multi aligned reads
0
5
Entering edit mode
3.6 years ago
SMILE ▴ 160

Hello everyone!

I have paired-end mouse RNA seq data mapped with Hisat2 and STAR. The alignment rate of one sample is not satisfactory for both methods. I have searched all the possible reasons, but still didn't find a good reason and solution.

Before doing alignment, I trimmed off low quality reads with Trimmomatic, checked the quality of the data with fastQC, and also removed rRNA using sortmeRNA.

One of the samples showed poor alignment about 40% while other samples showed good alignment rates around 70%-90%.

Below please find the command I used for Hisat2 and its results.

hisat2 -p 4 -x  genome --dta-cufflinks -1 MM_1.clean.fastq -2 MM_2.clean.fastq -S MM.Hisat2.sam > MM.summary.txt 2>&1


the MM.summary.txt is shown below:

49089655 reads; of these:

49089655 (100.00%) were paired;

of these:

34997503 (71.29%) aligned concordantly 0 times

12377732 (25.21%) aligned concordantly exactly 1 time

1714420 (3.49%) aligned concordantly >1 times

----
34997503 pairs aligned concordantly 0 times; of these:

3067380 (8.76%) aligned discordantly 1 time

----
31930123 pairs aligned 0 times concordantly or discordantly; of these:

63860246 mates make up the pairs; of these:

59407644 (93.03%) aligned 0 times

2711550 (4.25%) aligned exactly 1 time

1741052 (2.73%) aligned >1 times


39.49% overall alignment rate

Bellow are the command line for STAR by defaut parameters and its result:

STAR --runThreadN 12 --genomeDir ../refmm10/Indices/ --sjdbGTFfile ../genes.gtf --sjdbOverhang 100 --readFilesIn MM_1.clean.fastq MM_2.clean.fastq --outFileNamePrefix MM-STAR


the Log.final.out is shown below:

Number of input reads | 49089655

Average input read length |   179

Uniquely mapped reads number |   13176109

Uniquely mapped reads % |   26.84%

Average mapped length |   178.14

Number of splices: Total |   3753462

Number of splices: Annotated (sjdb) |   3601755

Number of splices: GT/AG |   3709167

Number of splices: GC/AG |   22195

Number of splices: AT/AC |   2175

Number of splices: Non-canonical |   19925

Mismatch rate per base, % |   0.49%

Deletion rate per base |   0.01%

Deletion average length |   1.28

Insertion rate per base |   0.00%

Insertion average length |   1.27

Number of reads mapped to multiple loci |   2476495

% of reads mapped to multiple loci |   5.04%

Number of reads mapped to too many loci |   340273

% of reads mapped to too many loci |   0.69%

% of reads unmapped: too many mismatches |   0.00%

% of reads unmapped: too short |   64.58%

% of reads unmapped: other |   2.84%

Number of chimeric reads |   0

% of chimeric reads |   0.00%


After the change of some parameters the results of STAR seems better but still have many multi aligned reads

Below are the altered command line and its results:

STAR --runThreadN 12 --genomeDir ../refmm10/Indices/ --sjdbGTFfile ../genes.gtf --sjdbOverhang 100 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0 --outFilterMismatchNmax 2 --readFilesIn MM_1.clean.fastq MM_2.clean.fastq --outFileNamePrefix MM-STAR


The Log.final.out is shown below:

Number of input reads | 49089655

Average input read length |   179

Uniquely mapped reads number |   28696596

Uniquely mapped reads % |   58.46%

Average mapped length |   101.60

Number of splices: Total |   4038784

Number of splices: Annotated (sjdb) |   3754328

Number of splices: GT/AG |   3845904

Number of splices: GC/AG |   24571

Number of splices: AT/AC |   4309

Number of splices: Non-canonical |   164000

Mismatch rate per base, % |   0.60%

Deletion rate per base |   0.01%

Deletion average length |   1.28

Insertion rate per base |   0.00%

Insertion average length |   1.42

Number of reads mapped to multiple loci |   18015809

% of reads mapped to multiple loci |   36.70%

Number of reads mapped to too many loci |   981031

% of reads mapped to too many loci |   2.00%

% of reads unmapped: too many mismatches |   0.00%

% of reads unmapped: too short |   0.00%

% of reads unmapped: other |   2.84%

Number of chimeric reads |   0

% of chimeric reads |   0.00%


Can anyone help me find out the reason of Hisat2 poor alignment rate and STAR'S high rate of % of reads mapped to multiple loci ?

THANS A LOT!

RNA-Seq • 8.2k views
1
Entering edit mode

It's possible that you have contamination from another organism, or other non-genomic contamination. Since you ran FastQC, posting the results would be helpful. Otherwise, I suggest you perform adapter-trimming and also BLAST a couple thousand unaligned reads to see what they are.

0
Entering edit mode

Thank you for your suggestion,Brian. The quality of the reads is quite good(first picture). The adapter content is also very low (second picture). What looks a little strange is the Per sequence GC content (third oicture) and Sequence Duplication Levels (fourth picture). Yes, as you said, I will BLAST some unaligned reads to see what they are. I will post the result after I have done the blast.

Below, please find the figures of the FastQC result.

0
Entering edit mode

It is better to stick with one aligner for a project. NGS aligners have many nuances built in (keep in mind that even if you do not explicitly change some options they all have/use default values, which can affect the final result) and trying to compare results from two aligners may prove a futile exercise.

BTW: Are you sure SortMeRNA has done its job well? Did you have a lot of rRNA contamination?

1
Entering edit mode

After several attempts, I finally get a better alignment with the data without any filteration (it seems after filteration, I get poorer alignment)...Uniquely mapped reads from 58.46% to 63.83% and multi mappers from 36.70% to 31.10%...Not big improvement, but it seems better. Other samples' uniquely mapped reads are around 80% and multi mappers are around 15%. Without rRNA contamination and other species' contamination. I suppose the multi mappers are caused by "too deep sequencing", since there are more than 50 million 101 * 101 bp pair-ends reads per sample, around 11 billion of bases sequenced per sample. But it still cannot explain why this sample has more multi mappers than others...

0
Entering edit mode

The reason why I use different aligners is that I want to select a best pipeline for my data to do the downstream differential expressed genes (DEG) analysis. I really see a significant difference of DEG results between different aligners. I am quite disappointed by this, I hope they can get consistent downstream resluts. If the first alignment step has so much influence on the downstream analysis, how much confidence can I have in the final results if I just blindly choose one... It confused me.

For the SortMeRNA, the command line and the last lines of log file can be seen below. I used Hisat2 to align the data to the mouse rRNA, the result is none of the reads aligned to the mouse rRNA. Maybe Hisat2 is not a good tool for this test? I then use BWA to do the test. I will post it if here if there are some intersting results.

This is the command I use:

./sortmerna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s:./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db --reads ../data/"x"-interleaved.fq --sam --num_alignments 1 --fastx --aligned ../data/"x"_rRNA --other ../data/"\$x"_non_rRNA --log -v --paired_in

This is the last lines of the log file:

Results:

Total reads passing E-value threshold = 3849425 (3.46%)

Total reads failing E-value threshold = 107434081 (96.54%)

By database:

./rRNA_databases/silva-bac-16s-id90.fasta       0.22%

./rRNA_databases/silva-bac-23s-id98.fasta       0.30%

./rRNA_databases/silva-arc-16s-id95.fasta       0.02%

./rRNA_databases/silva-arc-23s-id98.fasta       0.03%

./rRNA_databases/silva-euk-18s-id95.fasta       1.03%

./rRNA_databases/silva-euk-28s-id98.fasta       1.76%

./rRNA_databases/rfam-5s-database-id98.fasta        0.09%

./rRNA_databases/rfam-5.8s-database-id98.fasta      0.00%

0
Entering edit mode

STAR is an aligner many like so you may want to take that into consideration. If you are up for it then I will suggest that you try bbmap.sh from BBMap suite. @Brian is the author of that program. You would want to take into consideration ambig= option when considering what to do with the multi-mappers (they may not necessarily be rRNA). It should produce results as good or better than any other splice-aware aligner out there.

0
Entering edit mode

Thank you for this advice genomax! I will try this tool out and see what will happen. It really confuses me that this sample has so poor alignment rate and so many multi-mappers, when everything else seems good. I will post the results to see what BBMap can do to help me analyze this problem.

0
Entering edit mode

Your samples probably differ significantly in rRNA contamination. Do you have sorted files of rRNA reads for two samples? Do these files differ significantly in size? What is the size of kept file and discarded file for each sample?

0
Entering edit mode

I used SortMeRNA to filter the rRNA, the command line and the last lines of log file can be seen on the last reply above. After SOrtMeRNA the sizes of kept files of non_rRNA_1.fq are around 13 G, no big difference between poorest sample and the best alignmet samples. The sizes of discarded files of rRNA.fq are 1.1G to 2.0G, not so much difference? What do you mean by "sorted files of rRNA reads for two samples" in your first question?