I am new to RNA seq analysis and was hoping if the community would help me understand a few things about the analysis. I have 16 human samples (8 samples pre-treatment and 8 samples post treatment), and I am trying to compare genes deferentially expressed between these two groups.
I aligned my data using STAR 2.5.2a using these parameters:
STAR --runThreadN 16 --runMode alignReads --genomeDir star-genome \ --readFilesIn R1.fastq R2.fastq --outSAMtype BAM SortedByCoordinate \ --twopassMode Basic --outSAMattrIHstart 0 --outReadsUnmapped Fastx \ --quantMode GeneCounts TranscriptomeSAM --outWigType wiggle
I read that- with STAR, an alignment of 80-90% is expected for human data. For my data sets, 13 out of 16 have lower alignment ranging from (58 - 75%), and for these samples the % of reads mapped to multiple loci range from (18 - 35%). The RNA-seq protocol used was Truseq stranded rna seq, rRNA depletion method.
1) a. Is the higher multi-mapping due to insufficient rRNA depletion? Below (end of the post) is the output for one of the samples, and for this I checked how many reads mapped to one of the rRNA locus chrUn_GL000220 - --GL000220.1 161802 479866 0 .. Is this number 479866 too high? I have read across forums that some people recommend proceeding with the analysis without worrying about rRNA and some say filtering out rRNA is a good idea. For my output below, is it okay to ignore the rRNA reads (or the 30% multi-mapping) and move on with the further analysis? Why? b. What other reasons could there be for high multi-mapping? c. Should I adjust some parameters in the STAR command to get a better alignment?
2) When it comes to deciding the next step based on numbers (# of input reads, % of uniquely mapped, % of multi-mapped), when is it fairly acceptable to proceed with DEA? What kind of numbers will give enough power for downstream analysis?
Number of input reads | 24316914 Average input read length | 150 UNIQUE READS: Uniquely mapped reads number | 14992526 Uniquely mapped reads % | 61.65% Average mapped length | 150.14 Number of splices: Total | 7431072 Number of splices: Annotated (sjdb) | 7422873 Number of splices: GT/AG | 7373882 Number of splices: GC/AG | 41453 Number of splices: AT/AC | 4380 Number of splices: Non-canonical | 11357 Mismatch rate per base, % | 0.60% Deletion rate per base | 0.01% Deletion average length | 1.55 Insertion rate per base | 0.00% Insertion average length | 1.41 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 7311806 % of reads mapped to multiple loci | 30.07% Number of reads mapped to too many loci | 59278 % of reads mapped to too many loci | 0.24% UNMAPPED READS: % of reads unmapped: too many mismatches | 0.00% % of reads unmapped: too short | 7.60% % of reads unmapped: other | 0.44% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%