Reason for low mapping rate for murine RNA-seq data with splice-aware aligner?
2
0
Entering edit mode
8.9 years ago
khandarius • 0

Hello! I have 86 bp single reads from Illumina NextSeq500. Library preparation was carried out with the TruSeq stranded total RNA (Ribo-zero) kit for RNA extracted from mouse embryos. I've mapped the reads to the mm10 reference genome (chr1-19, chrX, chrY, chrM) with the subjunc junction-mapping aligner from the Rsubread software package (default settings). The mapping rate is only ~50% with raw or quality trimmed reads. I'd be glad to hear your ideas as to why.

Please inspect the Fastqc report of my raw reads yourself, if you wish to: https://drive.google.com/file/d/0B0NZ5u2nKR2qeG14Q25WSXFXNjQ/view?usp=sharing

The report is what one would expect from Illumina sequencing, I think. The slightly over-represented sequences (1,4% in total) are small nuclear RNAs according to a BLAST search. I tried fastq_quality_trimmer from the fastx toolkit to trim 3' bases (quality threshold was set to 20). According to the Fastqc report some of the bases in the middle of the reads are poorer quality (<20, lower whiskers of the boxplots) - could this be affecting the mapping? Should I use more stringent trimming or even filtering based on overall sequence quality?

Thanks in advance.

alignment RNA-Seq • 4.8k views
ADD COMMENT
0
Entering edit mode

UPDATE: I got 80% unique mapping rate with STAR 2.4. It seems the mismatch rate of my samples is a bit on the high side (2.4% per base according to STAR output). According to samstat 23% of my uniquely mapped reads have at least 4 mismatches. Rsubread and TopHat2 are more conservative regarding mismatches, it would seem. I've yet to try BBMap.

ADD REPLY
2
Entering edit mode
8.9 years ago

I suggest you adapter-trim the reads, but do not quality-trim them (especially not to Q20, which is too high), then try mapping with BBMap, which is very tolerant of read errors. Include the flags maxindel=100k intronlen=10, like this:

bbmap.sh in=reads.fq ref=mm10.fa out=mapped.sam outu=unmapped.fq maxindel=100k

You should get over 95% of reads mapping. If you don't, try BLASTing some of the reads in the unmapped file; they could be contamination from another organism with similar GC, like human.

P.S. It's also worthwhile trimming the last 1bp of all the reads, regardless of quality, before doing adapter-trimming or mapping. That base is extremely inaccurate on NextSeq.

ADD COMMENT
0
Entering edit mode

Adapter trimming affected only a miniscule proportion of the reads (<0.1%). I'll check out BBMap. I blasted 10 first and 10 last sequences from the FASTQ files: there were some sequences that didn't match any database, but those that did matched mouse sequences. I'll try mapping my reads against different genomes to find out if there is contamination,

ADD REPLY
1
Entering edit mode
8.9 years ago
Gary ▴ 480

We have a similar experience for some RNA-Seq samples, and finally found that many sequences were contaminated reads from yeasts. I hope that your samples haven't the same problem we have.

ADD COMMENT
0
Entering edit mode

I'll try mapping my reads to the yeast genome to see if this is the case. I checked rRNA contamination, at least that's not it.

ADD REPLY

Login before adding your answer.

Traffic: 2549 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6