Question

STAR output - high multi-mapping, low alignment (human)

1

Entering edit mode

7.1 years ago

anikb ▴ 40

Hello,

I am new to RNA seq analysis and was hoping if the community would help me understand a few things about the analysis. I have 16 human samples (8 samples pre-treatment and 8 samples post treatment), and I am trying to compare genes deferentially expressed between these two groups.

I aligned my data using STAR 2.5.2a using these parameters:

STAR --runThreadN 16 --runMode alignReads --genomeDir star-genome \
  --readFilesIn R1.fastq R2.fastq --outSAMtype BAM SortedByCoordinate \
  --twopassMode Basic --outSAMattrIHstart 0 --outReadsUnmapped Fastx \
  --quantMode GeneCounts TranscriptomeSAM --outWigType wiggle

I read that- with STAR, an alignment of 80-90% is expected for human data. For my data sets, 13 out of 16 have lower alignment ranging from (58 - 75%), and for these samples the % of reads mapped to multiple loci range from (18 - 35%). The RNA-seq protocol used was Truseq stranded rna seq, rRNA depletion method.

1) a. Is the higher multi-mapping due to insufficient rRNA depletion? Below (end of the post) is the output for one of the samples, and for this I checked how many reads mapped to one of the rRNA locus chrUn_GL000220 - --GL000220.1 161802 479866 0 .. Is this number 479866 too high? I have read across forums that some people recommend proceeding with the analysis without worrying about rRNA and some say filtering out rRNA is a good idea. For my output below, is it okay to ignore the rRNA reads (or the 30% multi-mapping) and move on with the further analysis? Why? b. What other reasons could there be for high multi-mapping? c. Should I adjust some parameters in the STAR command to get a better alignment?

2) When it comes to deciding the next step based on numbers (# of input reads, % of uniquely mapped, % of multi-mapped), when is it fairly acceptable to proceed with DEA? What kind of numbers will give enough power for downstream analysis?

Number of input reads | 24316914
Average input read length | 150
UNIQUE READS:
Uniquely mapped reads number |  14992526
Uniquely mapped reads % |   61.65%
Average mapped length | 150.14
Number of splices: Total |  7431072
Number of splices: Annotated (sjdb) |   7422873      
Number of splices: GT/AG |  7373882
Number of splices: GC/AG |  41453
Number of splices: AT/AC |  4380
Number of splices: Non-canonical |  11357
Mismatch rate per base, % | 0.60%
Deletion rate per base |    0.01%
Deletion average length |   1.55
Insertion rate per base |   0.00%
Insertion average length |  1.41
MULTI-MAPPING READS:
Number of reads mapped to multiple loci |   7311806
% of reads mapped to multiple loci |    30.07%
Number of reads mapped to too many loci |   59278
% of reads mapped to too many loci |    0.24%
UNMAPPED READS:
% of reads unmapped: too many mismatches |  0.00%
% of reads unmapped: too short |    7.60%
% of reads unmapped: other |    0.44%
CHIMERIC READS:
Number of chimeric reads |  0
% of chimeric reads |   0.00%

Thanks!

rna-seq alignment • 13k views

ADD COMMENT • link updated 7.1 years ago by h.mon 35k • written 7.1 years ago by anikb ▴ 40

0

Entering edit mode

Hello bandita.adhikari,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 7.1 years ago by h.mon 35k

0

Entering edit mode

Thank you very much! Since this is one of my first couple posts, I overlooked the formatting. Apologies. I will use the formatting bar from next time.

ADD REPLY • link 7.1 years ago by anikb ▴ 40

score 11 · Answer 1 · 2018-06-01

11

Entering edit mode

7.1 years ago

h.mon 35k

For me, the best - fast and sensitive - method for checking for rRNA contamination is bbduk (see How to identify 16s sequences from binning data(contigs)? thread). The number you got from mapping to the chrUn_GL000220 locus ( 479866 / 24316914 = 0.02; or 1.97%) is low and is not even close to explaining your multi-mapping rate of 30.07%, but honestly I don't know if this method is reliable (it may be).

I've seen low quality RNA generate high rRNA contamination. In those cases, the analysis was compromised, because the degraded RNA affected transcript distribution, and the samples with rRNA didn't cluster within treatment, but mostly between themselves. Regarding if it is appropriate to proceed with the analysis or not in your case, a PCA / MDS plot will help a lot to address this question. Where the samples with high rate of multi-mappers cluster? If they cluster at the expected places, then the regular rules of thumb apply: between 10-20 million reads for differential gene expression, 50-100 million reads for differential transcript expression.

I would stick with default STAR settings, but I am pretty conservative. If you feel adventurous, you can try to tweak the parameters and someone else can help, but not me.

edit: if you are using STAR gene counts, it doens't matter if you filter rRNA or not, as they are not included on the counts table.

ADD COMMENT • link 7.1 years ago by h.mon 35k

1

Entering edit mode

I see. Thank you, h.mon. This is extremely helpful!

ADD REPLY • link 7.1 years ago by anikb ▴ 40

0

Entering edit mode

if you are using STAR gene counts, it doens't matter if you filter rRNA or not, as they are not included on the counts table

Anything that is present in the GTF should be included.

Although defining rRNA is not trivial (in the context of RNA-seq): RNA-seq rRNA contamination

ADD REPLY • link 7.1 years ago by igor 13k

2

Entering edit mode

Sorry, I wasn't clear enough: rRNA counts are not included in the final counts because rRNA genes have multiple copies, hence they map to multiple genomic locations, and STAR (and HTSeq and featureCounts by default) discards multi-mappers counts.

ADD REPLY • link 7.1 years ago by h.mon 35k

0

Entering edit mode

So does this mean that STAR/HTSEQ/featurecounts discards 30.07% (reads mapped to multiple loci) of the data in this case?

ADD REPLY • link 7.1 years ago by BioinfGuru ★ 2.1k

0

Entering edit mode

Yes, exactly. I know featureCounts has a one parameter to include multi-mappers and one or two controlling the behaviour of multi-mappers counting, but I would advise against using these options. The best way to count multi-mappers is to use Salmon / kallisto / RSEM, which use an expectation-maximization algorithm to optimally distribute multi-mappers counts between multiple transcripts / genes.

ADD REPLY • link 7.1 years ago by h.mon 35k

0

Entering edit mode

Yes. Most rRNA will be discarded due to multi-mapping, but there are still some unique mappers.

ADD REPLY • link 7.1 years ago by igor 13k

score 3 · Answer 2 · 2018-06-01

61.65% uniquely mapped reads

30.07% of reads mapped to multiple loci

I don't see the problem... you have >90% mapping. Not sure if 60:30 ratio should be more like 80:10. Although I'd like to hear what others have to say. I use low mapping % as an indicator of contamination or failed alignment. I would be happy with this and not consider it any further. I dare say it would be a pretty impressive tool that would uniquely map everything.

It would be a very bad idea to downsize your data based on anything other than random selection - that's a fast track to introducing bias.. So no do not do that. Downstream analysis power comes mainly from read depth, coverage and no. of replicates. So you should be fine.

I'm definitely open to correction if anyone more experienced spots something wrong here.