Entering edit mode
8.2 years ago
aggregatibacter
▴
180
Hi,
I have Illumina paired-end RNA-Seq data (prepared with the TruSeq stranded kit) for human tissue biopsies.
After QC and alignment to the ENSEMBL genome and gtf (GRCh38 rel 84 from ensembl.org) using STAR (alignment perc. of 75% - 90%), I use featureCounts (in R) to count genes.
I get very different rates of successfully assigned fragments, ranging from 20% to about 60%, with the majority about 45%. Is this normal? Or could something be wrong with (some of) my libraries? Do I need to account for the different assignment proportions?
Many thanks!
Take a look in a genome browser, try to figure out if the reads fall along non coding regions, introns, or perhaps you can track peaks in unexpected regions.
Are you using the correct "stranded" option for TruSeq which should be "reverse stranded"?
Yes, I use strandSpecific=-1. With forward stranded, I got less than 5% in all samples.
Did you use
-s 2
withfeatureCounts
?Yes, or -1 (which seems to be the same thing).
You may have a lot of multi-mapped reads. Those would not be assigned a gene. What is you alignment rate for unique alignments only?
I have 75% - 90% of uniquely mapped reads. % of reads mapped to multiple loci ranges from 15% to about 5% - could this be the reason for the observed low assignment rate? Would you suggest to (i) accept this and go on, or (ii) try to count the multiple loci reads, too (which is possible in featurecounts, if I am not mistaken).
That number of uniquely mapped reads is fine, nothing to worry about. Most likely obvious, but you did align to hg38 and count for the same genome? Could you share the featureCounts command/code you used?
Sure:
I used the gft I gave to STAR when generating its genome -
I also get this warning, but that should be rather normal because of the multi-mapping? I donĀ“t have STAR sort my SAMs - is that ok?
Seems nothing to worry about. Could you tell me, for example for this sample, how many reads you had (before mapping!). It is possible that some reads map each many many times, and as such inflate the 26620478, and lower the percentage. In fact, you want to compare the 'Successfully assigned fragments' to the 'uniquely mapped' reads.
I feel you may be right, it may all be full of multi-mapping reads.
Different from the defaults, I use several options the STAR manual presents as ENCODE standard options (and I therefore found them possibly useful)
--outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --alignIntronMin 20 --alignIntronMax 10000 --alignMatesGapMax 1000000
Possibly, those will add to further inflation?
See below for the logfile for the sample I pasted before (and many thanks for you efforts, btw!)
Based on this, 12283289 were assigned out of 16959018 uniquely mapping reads. That's already better than the fraction you had earlier :)
Many thanks for your insightful comments. I did a little reading on multi-mapping reads. From what I gather, I would rather avoid to count all of them in an RNA Seq diff exp analysis, would you agree?
I guess most software by default ignores multimapping reads so you don't have to worry about then. But you could argue that ignoring those reads is a loss of information. You cannot just combine the reads with the uniquely mapped, but I find the method described in this paper an interesting approach.
You could choose to map them to a random best site (that way you would not lose that information).
ambig=random
is what BBMap uses for this purpose.