STAR aligner maps 0 reads to multiple loci, but 3.7% to too many loci - meaning?
1
1
Entering edit mode
18 months ago
Anand Rao ▴ 360

Hi forum experts:

I am a first time STAR user, and also brand new to RNA-Seq mapping. And I seek some help with understanding my STAR ailgner results logfile, please. I understand STAR’s default parameters are optimized for mammalian genomes, so I should mention here that I am analyzing a plant genome / transcriptome.

Question 1 Specifically, what I find hard to understand is the section under under MULTI-MAPPING READS (that I am summarizing into 1 line):

Number of reads mapped to multiple loci | 0 = 0.00% Vs. Number of reads mapped to too many loci | 639919 = 3.72%

That is not normal, is it?

Question 2 For maximum and minimum intron size (that I did not specify in my test run), can I use the GTF file contents to empirically infer the shortest and longest annotated intron and use those instead of the default values? Or is there a better / more correct rationale to set these values?

The analyses steps I followed thus far, are detailed below. Any help in A. clarifying my confusion would be appreciated, and B. pointing out any other (potential) red flags in the logfile would also be helpful.

THANKS!

1. Raw reads pre-processed using BBmap pipeline (syntax not included)

Step 1-1. rename.sh = Rename SRR data files for BBmap compatibility
Step 1-2. clumpify.sh = Remove optical duplicates
Step 1-3. filterbytile.sh = Remove poor tiles in Illumina reads
Step 1-4. bbduk.sh = Adapter and quality trimming
Step 1-5. bbduk.sh = Filter out artifacts and PhiX spike-ins
Step 1-6. bbmap.sh = Filter out rRNA
Step 1-7. bbmap.sh = Filter out human and common bacterial contaminants
Step 1-8. bbsplit.sh = Split mapping to host genome versus symbiont genome
Step 1-9. Prepare final processed file as input for STAR aligner


2. Genome indexing using STAR

STAR --runThreadN 36 --runMode genomeGenerate --genomeDir \
./STAR_Genome_Index_MtrunA17r5.0-ANR/ --runDirPerm All_RWX --genomeFastaFiles \
./MtrunA17r5.0-ANR_Aug2019/MtrunA17r5.0-20161119-ANR.fasta --genomeChrBinNbits 18 \
--sjdbGTFfile ./MtrunA17r5.0-ANR_Aug2019/MtrunA17r5.0-ANR-EGN-r1.6.gtf --sjdbOverhang 99 --genomeSAindexNbases 13


3. RNA-Seq read mapping using STAR

STAR \
--runDirPerm All_RWX \
--genomeDir ./STAR_Genome_Index_MtrunA17r5.0-ANR/ \
--outFileNamePrefix SRR1726611_BBmap_STARmapping/ \
--outTmpDir SRR1726611_BBmap_STARmapping_TEMP/ \
--outTmpKeep None \
--outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \


STAR ALIGNMENT LOGFILE CONTENTS Started job on | Oct 10 17:50:37 Started mapping on | Oct 10 17:52:31 Finished on | Oct 10 17:54:14 Mapping speed, Million of reads per hour | 601.97

                          Number of input reads |   17223013
Average input read length |   99
Uniquely mapped reads number |   16275665
Uniquely mapped reads % |   94.50%
Average mapped length |   99.45
Number of splices: Total |   5603911
Number of splices: Annotated (sjdb) |   5587909
Number of splices: GT/AG |   5546577
Number of splices: GC/AG |   44931
Number of splices: AT/AC |   4167
Number of splices: Non-canonical |   8236
Mismatch rate per base, % |   0.36%
Deletion rate per base |   0.00%
Deletion average length |   1.36
Insertion rate per base |   0.00%
Insertion average length |   1.11
Number of reads mapped to multiple loci |   0
% of reads mapped to multiple loci |   0.00%
Number of reads mapped to too many loci |   639919
% of reads mapped to too many loci |   3.72%
Number of reads unmapped: too many mismatches |   0
% of reads unmapped: too many mismatches |   0.00%
Number of reads unmapped: too short |   306711
% of reads unmapped: too short |   1.78%
Number of reads unmapped: other |   718
% of reads unmapped: other |   0.00%
Number of chimeric reads |   0
% of chimeric reads |   0.00%

STAR RNA-Seq mapping • 1.4k views
3
Entering edit mode

a non exhaustive, short comment: STAR dismisses reads mapped to "too many loci" while it keeps "mutlimappers". There's a parameter to control the "too many loci" threshold, I just don't know it by heart...

2
Entering edit mode

It's --outFilterMultimapNmax and it is set to 1. Increasing this parameter will report more multi mapping reads in your BAM file.

0
Entering edit mode

Looks like default is --outFilterMultimapNmax 10,

So should I increment it's value from 1 to 10, while checking for changes in logfile contents or some other output file (*.bam ?) ? And / Or is there a way to empirically determine what is best outFilterMultimapNmax value for a certain input file?

I'd like to remind you about the other questions in my original post well (intron min/max lengths, any other issues in the logfile), thank you !

0
Entering edit mode

Since you are mostly working with BBTools any reason why you chose to not use bbmap.sh for these alignments?

3
Entering edit mode
18 months ago
h.mon 32k

Just to put more clearly what has already been explained above by Carambakaracho and michael.ante :

Number of reads mapped to multiple loci | 0 = 0.00% Vs. Number of reads mapped to too many loci | 639919 = 3.72%

That is not normal, is it?

This is a direct consequence of the --outFilterMultimapNmax 1 setting you imposed to STAR.

With STAR default settings, "uniquely mapped reads" map to just one location, "reads mapped to multiple loci" are reads mapping to between 2 and 10 locations, and "reads mapped to too many loci" are reads mapping to more than 10 locations.

With your modified setting --outFilterMultimapNmax 1, reads mapping to 2 locations or more are already classified as "reads mapped to too many loci", and there won't be any "reads mapped to multiple loci".

I tend to leave STAR settings at its defaults, unless there is specific reason to modify them. For this particular file, the effect of the --outFilterMultimapNmax 1 is probably minimal, as you have only 3.72% of reads mapping more than once.

1
Entering edit mode

Thanks very much for your simple and elegant explanation. I played around with --outFilterMultimapNmax values from 1 - 20, just to see what happens with the categories of mapped reads - unique vs. multiple vs. too many. And in congruence with your explanation, when outFilterMultimapNmax is increased, the % of reads with too many matches decreases. As shown in the images below.

I do have 3 follow-up questions though.

1. What are STAR parameter(s) that ought to be empirically determined, rather than used at default? (how about min/max length of intron, a question in my original post above)

2. From my summary screenshots shown above, in your opinion, does it appear --outFilterMultimapNmax=10 is best, or some other value? And why?

3. Why is there even such a parameter as MultimapNmax in STAR aligner? Is it due to biological reason(s) or computational consideration(s) or both?

0
Entering edit mode

What downstream analyses you intend to perform?

Regarding intron size, yes, it would be reasonable to get an empirical distribution of their sizes from the annotation you have, then compare it with the intron sizes from the human genome annotation - as STAR has been likely optimized to the human genome. If the intron sizes of the plant species you study are much greater than the human genome intron sizes, you could tweak the STAR parameter controlling it.

The only STAR setting I ever tweaked was exactly --outFilterMultimapNmax. I increased it to a big number, as I was interested in mapping to very repetitive regions of the genome.

0
Entering edit mode

I have 100+ single end SE Illumina libraries across 5 genotypes, 12 time points, 4 replicates, that I want to run differential RNA-Seq time-series analyses on. And I'd like to learn to perform all analyses steps (including QC for these steps), without any shortcuts :)

Thanks for validating use of empirical intron lengths from already available GFF/GTF annotation info, to specify min and mx intron lengths for my STAR mapping runs.

I am interesting in learning about some different scenarios under which parameters are altered from default (like your example of large --outFilterMultimapNmax to allow mapping to repetitive regions of the genome, that are rarely genic). If you know of any link other than this 2016 paper, please let me know: Optimizing RNA-Seq mapping with STAR link (I've looked at the papers from Dobin et al.)

Thanks!