Question: Reason for low mapping rate for murine RNA-seq data with splice-aware aligner?
gravatar for khandarius
4.2 years ago by
khandarius0 wrote:

Hello! I have 86 bp single reads from Illumina NextSeq500. Library preparation was carried out with the TruSeq stranded total RNA (Ribo-zero) kit for RNA extracted from mouse embryos. I've mapped the reads to the mm10 reference genome (chr1-19, chrX, chrY, chrM) with the subjunc junction-mapping aligner from the Rsubread software package (default settings). The mapping rate is only ~50% with raw or quality trimmed reads. I'd be glad to hear your ideas as to why.


Please inspect the Fastqc report of my raw reads yourself, if you wish to:


The report is what one would expect from Illumina sequencing, I think. The slightly overrepresented sequences (1,4% in total) are small nuclear RNAs according to a BLAST search. I tried fastq_quality_trimmer from the fastx toolkit to trim 3' bases (quality threshold was set to 20). According to the Fastqc report some of the bases in the middle of the reads are poorer quality (<20, lower whiskers of the boxplots) – could this be affecting the mapping? Should I use more stringent trimming or even filtering based on overall sequence quality?


Thanks in advance.

rna-seq alignment • 3.0k views
ADD COMMENTlink modified 4.2 years ago by Gary450 • written 4.2 years ago by khandarius0

UPDATE: I got 80% unique mapping rate with STAR 2.4. It seems the mismatch rate of my samples is a bit on the high side (2.4% per base according to STAR output). According to samstat 23% of my uniquely mapped reads have atleast 4 mismatches. Rsubread and TopHat2 are more conservative regarding mismatches, it would seem. I've yet to try BBMap.

ADD REPLYlink written 4.1 years ago by khandarius0
gravatar for Brian Bushnell
4.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I suggest you adapter-trim the reads, but do not quality-trim them (especially not to Q20, which is too high), then try mapping with BBMap, which is very tolerant of read errors.  Include the flags "maxindel=100k intronlen=10", like this: in=reads.fq ref=mm10.fa out=mapped.sam outu=unmapped.fq maxindel=100k

You should get over 95% of reads mapping.  If you don't, try BLASTing some of the reads in the unmapped file; they could be contamination from another organism with similar GC, like human.

P.S.  It's also worthwhile trimming the last 1bp of all the reads, regardless of quality, before doing adapter-trimming or mapping.  That base is extremely inaccurate on NextSeq.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Brian Bushnell16k

Adapter trimming affected only a miniscule proportion of the reads (<0.1%). I'll check out BBMap. I blasted 10 first and 10 last sequences from the FASTQ files: there were some sequences that didn't match any database, but those that did matched mouse sequences. I'll try mapping my reads against different genomes to find out if there is contamination,

ADD REPLYlink written 4.2 years ago by khandarius0
gravatar for Gary
4.2 years ago by
Taiwan/Taichung/China Medical University Hospital
Gary450 wrote:

We have a similar experience for some RNA-Seq samples, and finally found that many sequences were contaminated reads from yeasts. I hope that your samples haven’t the same problem we have.  

ADD COMMENTlink written 4.2 years ago by Gary450

I'll try mapping my reads to the yeast genome to see if this is the case. I checked rRNA contamination, atleast that's not it.

ADD REPLYlink written 4.2 years ago by khandarius0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 576 users visited in the last hour