Low mapping rate due to rRNA
1
0
Entering edit mode
22 months ago
jgarces ▴ 50

Hi there,

I have a series of RNA-seq samples and, when I perform the alignment with STAR (version 2.7.9a), only a little percentage of reads is aligned...

FastQC shows a high number of overrepresented sequences related with rRNA (determined manually by BLAST) and a weird "Per base sequence content". I have two questions:

  • I have no more than 3M of reads when summing the number of rRNA reads... so I would expect not to affect to global alignment. duplicated reads
  • Why "Per base sequence content" is so "unstable"? Could it be the cause of low mapping rate? seq content

Is there any way to improve these mapping rates? Thanks in advance!

EDIT

The used protocol for rRNA depletion is the: Illumina Stranded Total RNA Prep with Ribo-Zero Plus...

and the STAR command used:

base_name=${filename%_R?_001.cutadapt.fastq.gz}

STAR --runThreadN 20 \
        --outFilterMismatchNmax 3 \
        --alignEndsType Local \
        --outFilterMultimapNmax 10 \
        --outMultimapperOrder Random \
        --genomeDir $reference \
        --readFilesIn ${base_name}_R1_001.cutadapt.fastq.gz ${base_name}_R2_001.cutadapt.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix ${base_name}.star. \
        --outSAMtype BAM SortedByCoordinate \
        --outBAMsortingThreadN 10 \
        --genomeLoad LoadAndKeep \
        --outSAMunmapped Within \
        --limitBAMsortRAM 40000000000
RNA-seq rRNA STAR • 1.3k views
ADD COMMENT
1
Entering edit mode

You may want to provide the STAR command you are running so we can double check the code, and more detail from the STAR results log so we can check if there is a possibility that rRNA contamination could be contributing to this. Also, how was rRNA depleted in your RNA-seq? Was it a ribosomal depletion kit, poly-dT priming, etc..

If the above has been checked and confirmed you can get a more accurate idea of rRNA contamination by filtering your reads with a program such as BBDuk.

The per base sequence content looks typical of RNA-seq, so I wouldn't worry about that part.

ADD REPLY
0
Entering edit mode

Thanks for the reply. I've just updated the post (good suggestion).

ADD REPLY
1
Entering edit mode

Also, try running the FastQScreen tool to check if it is actually rRNA contamination. ^_^

ADD REPLY
2
Entering edit mode
22 months ago
Trivas ★ 1.7k

One thing to keep in mind is that the overrepresented sequences in a fastqc report are a subset of the total number of overrepresented sequences. The software looks at the first 100k reads, then any overrepresented sequences found in those first 100k are tracked throughout the rest of your fastq file. See here. In other words, summing up the "counts" in your fastqc report does not tell the full story, especially when a few of them are one nucleotide different from each other. Bottom line, your data likely have much more than 3 million rRNA reads.

ADD COMMENT
0
Entering edit mode

That's absolutely true, thanks for remembering it!

So I guess if I try to align my samples versus a reference with ribosomal sequences I would "detect" if these unmapped reads correspond to rRNA, do you know where I can download a reference with rRNA regions, please?

ADD REPLY
0
Entering edit mode

I followed this guide: Creating ribosomal RNA reference sequence. So far it's worked well enough for me to monitor how my custom rRNA depletion protocol is working.

ADD REPLY

Login before adding your answer.

Traffic: 2423 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6