Hi forum experts:
I am a first time STAR user, and also brand new to RNA-Seq mapping. And I seek some help with understanding my STAR ailgner results logfile, please. I understand STAR’s default parameters are optimized for mammalian genomes, so I should mention here that I am analyzing a plant genome / transcriptome.
Question 1 Specifically, what I find hard to understand is the section under under MULTI-MAPPING READS (that I am summarizing into 1 line):
Number of reads mapped to multiple loci | 0 = 0.00% Vs. Number of reads mapped to too many loci | 639919 = 3.72%
That is not normal, is it?
Question 2 For maximum and minimum intron size (that I did not specify in my test run), can I use the GTF file contents to empirically infer the shortest and longest annotated intron and use those instead of the default values? Or is there a better / more correct rationale to set these values?
The analyses steps I followed thus far, are detailed below. Any help in A. clarifying my confusion would be appreciated, and B. pointing out any other (potential) red flags in the logfile would also be helpful.
1. Raw reads pre-processed using BBmap pipeline (syntax not included)
Step 1-1. rename.sh = Rename SRR data files for BBmap compatibility Step 1-2. clumpify.sh = Remove optical duplicates Step 1-3. filterbytile.sh = Remove poor tiles in Illumina reads Step 1-4. bbduk.sh = Adapter and quality trimming Step 1-5. bbduk.sh = Filter out artifacts and PhiX spike-ins Step 1-6. bbmap.sh = Filter out rRNA Step 1-7. bbmap.sh = Filter out human and common bacterial contaminants Step 1-8. bbsplit.sh = Split mapping to host genome versus symbiont genome Step 1-9. Prepare final processed file as input for STAR aligner
2. Genome indexing using STAR
STAR --runThreadN 36 --runMode genomeGenerate --genomeDir \ ./STAR_Genome_Index_MtrunA17r5.0-ANR/ --runDirPerm All_RWX --genomeFastaFiles \ ./MtrunA17r5.0-ANR_Aug2019/MtrunA17r5.0-20161119-ANR.fasta --genomeChrBinNbits 18 \ --sjdbGTFfile ./MtrunA17r5.0-ANR_Aug2019/MtrunA17r5.0-ANR-EGN-r1.6.gtf --sjdbOverhang 99 --genomeSAindexNbases 13
3. RNA-Seq read mapping using STAR
STAR \ --runMode alignReads \ --runDirPerm All_RWX \ --genomeDir ./STAR_Genome_Index_MtrunA17r5.0-ANR/ \ --readFilesIn /home/aksrao/Mtr_Nod_Time_Course/FASTQ/SRR1726611/SRR1726611_BBmap_RawReadsProcessedFinal.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix SRR1726611_BBmap_STARmapping/ \ --outTmpDir SRR1726611_BBmap_STARmapping_TEMP/ \ --outTmpKeep None \ --outFilterMultimapNmax 1 \ --outReadsUnmapped Fastx \ --outSAMtype BAM SortedByCoordinate \ --twopassMode Basic \ --runThreadN 24
STAR ALIGNMENT LOGFILE CONTENTS Started job on | Oct 10 17:50:37 Started mapping on | Oct 10 17:52:31 Finished on | Oct 10 17:54:14 Mapping speed, Million of reads per hour | 601.97
Number of input reads | 17223013 Average input read length | 99 UNIQUE READS: Uniquely mapped reads number | 16275665 Uniquely mapped reads % | 94.50% Average mapped length | 99.45 Number of splices: Total | 5603911 Number of splices: Annotated (sjdb) | 5587909 Number of splices: GT/AG | 5546577 Number of splices: GC/AG | 44931 Number of splices: AT/AC | 4167 Number of splices: Non-canonical | 8236 Mismatch rate per base, % | 0.36% Deletion rate per base | 0.00% Deletion average length | 1.36 Insertion rate per base | 0.00% Insertion average length | 1.11 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 0 % of reads mapped to multiple loci | 0.00% Number of reads mapped to too many loci | 639919 % of reads mapped to too many loci | 3.72% UNMAPPED READS: Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 306711 % of reads unmapped: too short | 1.78% Number of reads unmapped: other | 718 % of reads unmapped: other | 0.00% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%