Question: Different alignment rates for Hisat2 and STAR, Hisat2 has lower alignment rate and STAR have many multi aligned reads
3
gravatar for Claire
29 days ago by
Claire40
Germany
Claire40 wrote:

Hello everyone!

I have paired-end mouse RNA seq data mapped with Hisat2 and STAR. The alignment rate of one sample is not satisfactory for both methods. I have searched all the possible reasons, but still didn't find a good reason and solution.

Please help me analyze this problem.

Before doing alignment, I trimmed off low quality reads with Trimmomatic, checked the quality of the data with fastQC, and also removed rRNA using sortmeRNA.

One of the samples showed poor alignment about 40% while other samples showed good alignment rates around 70%-90%.

Below please find the command I used for Hisat2 and its results.

hisat2 -p 4 -x  genome --dta-cufflinks -1 MM_1.clean.fastq -2 MM_2.clean.fastq -S MM.Hisat2.sam > MM.summary.txt 2>&1

the MM.summary.txt is shown below:

49089655 reads; of these:

49089655 (100.00%) were paired;

of these:

34997503 (71.29%) aligned concordantly 0 times

12377732 (25.21%) aligned concordantly exactly 1 time

1714420 (3.49%) aligned concordantly >1 times

----
34997503 pairs aligned concordantly 0 times; of these:

  3067380 (8.76%) aligned discordantly 1 time

----
31930123 pairs aligned 0 times concordantly or discordantly; of these:

  63860246 mates make up the pairs; of these:

    59407644 (93.03%) aligned 0 times

    2711550 (4.25%) aligned exactly 1 time

    1741052 (2.73%) aligned >1 times

39.49% overall alignment rate

Bellow are the command line for STAR by defaut parameters and its result:

STAR --runThreadN 12 --genomeDir ../refmm10/Indices/ --sjdbGTFfile ../genes.gtf --sjdbOverhang 100 --readFilesIn MM_1.clean.fastq MM_2.clean.fastq --outFileNamePrefix MM-STAR

the Log.final.out is shown below:

Number of input reads | 49089655

                  Average input read length |   179

                                UNIQUE READS:

               Uniquely mapped reads number |   13176109

                    Uniquely mapped reads % |   26.84%

                      Average mapped length |   178.14

                   Number of splices: Total |   3753462

        Number of splices: Annotated (sjdb) |   3601755

                   Number of splices: GT/AG |   3709167

                   Number of splices: GC/AG |   22195

                   Number of splices: AT/AC |   2175

           Number of splices: Non-canonical |   19925

                  Mismatch rate per base, % |   0.49%

                     Deletion rate per base |   0.01%

                    Deletion average length |   1.28

                    Insertion rate per base |   0.00%

                   Insertion average length |   1.27

                         MULTI-MAPPING READS:

    Number of reads mapped to multiple loci |   2476495

         % of reads mapped to multiple loci |   5.04%

    Number of reads mapped to too many loci |   340273

         % of reads mapped to too many loci |   0.69%

                              UNMAPPED READS:

   % of reads unmapped: too many mismatches |   0.00%

             % of reads unmapped: too short |   64.58%

                 % of reads unmapped: other |   2.84%

                              CHIMERIC READS:

                   Number of chimeric reads |   0

                        % of chimeric reads |   0.00%

After the change of some parameters the results of STAR seems better but still have many multi aligned reads

Below are the altered command line and its results:

STAR --runThreadN 12 --genomeDir ../refmm10/Indices/ --sjdbGTFfile ../genes.gtf --sjdbOverhang 100 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0 --outFilterMismatchNmax 2 --readFilesIn MM_1.clean.fastq MM_2.clean.fastq --outFileNamePrefix MM-STAR

The Log.final.out is shown below:

Number of input reads | 49089655

                  Average input read length |   179

                                UNIQUE READS:

               Uniquely mapped reads number |   28696596

                    Uniquely mapped reads % |   58.46%

                      Average mapped length |   101.60

                   Number of splices: Total |   4038784

        Number of splices: Annotated (sjdb) |   3754328

                   Number of splices: GT/AG |   3845904

                   Number of splices: GC/AG |   24571

                   Number of splices: AT/AC |   4309

           Number of splices: Non-canonical |   164000

                  Mismatch rate per base, % |   0.60%

                     Deletion rate per base |   0.01%

                    Deletion average length |   1.28

                    Insertion rate per base |   0.00%

                   Insertion average length |   1.42

                         MULTI-MAPPING READS:

    Number of reads mapped to multiple loci |   18015809

         % of reads mapped to multiple loci |   36.70%

    Number of reads mapped to too many loci |   981031

         % of reads mapped to too many loci |   2.00%

                              UNMAPPED READS:

   % of reads unmapped: too many mismatches |   0.00%

             % of reads unmapped: too short |   0.00%

                 % of reads unmapped: other |   2.84%

                              CHIMERIC READS:

                   Number of chimeric reads |   0

                        % of chimeric reads |   0.00%

Can anyone help me find out the reason of Hisat2 poor alignment rate and STAR'S high rate of % of reads mapped to multiple loci ?

THANS A LOT!

rna-seq • 334 views
ADD COMMENTlink modified 29 days ago by Satyajeet Khare1.1k • written 29 days ago by Claire40
1

It's possible that you have contamination from another organism, or other non-genomic contamination. Since you ran FastQC, posting the results would be helpful. Otherwise, I suggest you perform adapter-trimming and also BLAST a couple thousand unaligned reads to see what they are.

ADD REPLYlink written 28 days ago by Brian Bushnell14k

Thank you for your suggestion,Brian. The quality of the reads is quite good(first picture). The adapter content is also very low (second picture). What looks a little strange is the Per sequence GC content (third oicture) and Sequence Duplication Levels (fourth picture). Yes, as you said, I will BLAST some unaligned reads to see what they are. I will post the result after I have done the blast.

Below, please find the figures of the FastQC result.

ADD REPLYlink modified 28 days ago • written 28 days ago by Claire40

It is better to stick with one aligner for a project. NGS aligners have many nuances built in (keep in mind that even if you do not explicitly change some options they all have/use default values, which can affect the final result) and trying to compare results from two aligners may prove a futile exercise.

BTW: Are you sure SortMeRNA has done its job well? Did you have a lot of rRNA contamination?

ADD REPLYlink modified 29 days ago • written 29 days ago by genomax34k
1

After several attempts, I finally get a better alignment with the data without any filteration (it seems after filteration, I get poorer alignment)...Uniquely mapped reads from 58.46% to 63.83% and multi mappers from 36.70% to 31.10%...Not big improvement, but it seems better. Other samples' uniquely mapped reads are around 80% and multi mappers are around 15%. Without rRNA contamination and other species' contamination. I suppose the multi mappers are caused by "too deep sequencing", since there are more than 50 million 101 * 101 bp pair-ends reads per sample, around 11 billion of bases sequenced per sample. But it still cannot explain why this sample has more multi mappers than others...

ADD REPLYlink written 23 days ago by Claire40

Thank you for your answer, genomax!

The reason why I use different aligners is that I want to select a best pipeline for my data to do the downstream differential expressed genes (DEG) analysis. I really see a significant difference of DEG results between different aligners. I am quite disappointed by this, I hope they can get consistent downstream resluts. If the first alignment step has so much influence on the downstream analysis, how much confidence can I have in the final results if I just blindly choose one... It confused me.

For the SortMeRNA, the command line and the last lines of log file can be seen below. I used Hisat2 to align the data to the mouse rRNA, the result is none of the reads aligned to the mouse rRNA. Maybe Hisat2 is not a good tool for this test? I then use BWA to do the test. I will post it if here if there are some intersting results.

This is the command I use:

./sortmerna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s:./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db --reads ../data/"$x"-interleaved.fq --sam --num_alignments 1 --fastx --aligned ../data/"$x"_rRNA --other ../data/"$x"_non_rRNA --log -v --paired_in

This is the last lines of the log file:

Results:

Total reads = 111283506

Total reads passing E-value threshold = 3849425 (3.46%)

Total reads failing E-value threshold = 107434081 (96.54%)

Minimum read length = 101

Maximum read length = 101

Mean read length = 101

 By database:

./rRNA_databases/silva-bac-16s-id90.fasta       0.22%

./rRNA_databases/silva-bac-23s-id98.fasta       0.30%

./rRNA_databases/silva-arc-16s-id95.fasta       0.02%

./rRNA_databases/silva-arc-23s-id98.fasta       0.03%

./rRNA_databases/silva-euk-18s-id95.fasta       1.03%

./rRNA_databases/silva-euk-28s-id98.fasta       1.76%

./rRNA_databases/rfam-5s-database-id98.fasta        0.09%

./rRNA_databases/rfam-5.8s-database-id98.fasta      0.00%
ADD REPLYlink modified 28 days ago • written 28 days ago by Claire40

STAR is an aligner many like so you may want to take that into consideration. If you are up for it then I will suggest that you try bbmap.sh from BBMap suite. @Brian is the author of that program. You would want to take into consideration ambig= option when considering what to do with the multi-mappers (they may not necessarily be rRNA). It should produce results as good or better than any other splice-aware aligner out there.

ADD REPLYlink written 28 days ago by genomax34k

Thank you for this advice genomax! I will try this tool out and see what will happen. It really confuses me that this sample has so poor alignment rate and so many multi-mappers, when everything else seems good. I will post the results to see what BBMap can do to help me analyze this problem.

ADD REPLYlink written 28 days ago by Claire40

Your samples probably differ significantly in rRNA contamination. Do you have sorted files of rRNA reads for two samples? Do these files differ significantly in size? What is the size of kept file and discarded file for each sample?

ADD REPLYlink modified 29 days ago • written 29 days ago by Satyajeet Khare1.1k

Thank you for your answer, Satyajeet.

I used SortMeRNA to filter the rRNA, the command line and the last lines of log file can be seen on the last reply above. After SOrtMeRNA the sizes of kept files of non_rRNA_1.fq are around 13 G, no big difference between poorest sample and the best alignmet samples. The sizes of discarded files of rRNA.fq are 1.1G to 2.0G, not so much difference? What do you mean by "sorted files of rRNA reads for two samples" in your first question?

ADD REPLYlink written 28 days ago by Claire40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1542 users visited in the last hour