Question

Low mapping rate due to rRNA

0

Entering edit mode

22 months ago

jgarces ▴ 50

Hi there,

I have a series of RNA-seq samples and, when I perform the alignment with STAR (version 2.7.9a), only a little percentage of reads is aligned...

FastQC shows a high number of overrepresented sequences related with rRNA (determined manually by BLAST) and a weird "Per base sequence content". I have two questions:

I have no more than 3M of reads when summing the number of rRNA reads... so I would expect not to affect to global alignment.

Why "Per base sequence content" is so "unstable"? Could it be the cause of low mapping rate?

Is there any way to improve these mapping rates? Thanks in advance!

EDIT

The used protocol for rRNA depletion is the: Illumina Stranded Total RNA Prep with Ribo-Zero Plus...

and the STAR command used:

base_name=${filename%_R?_001.cutadapt.fastq.gz}

STAR --runThreadN 20 \
        --outFilterMismatchNmax 3 \
        --alignEndsType Local \
        --outFilterMultimapNmax 10 \
        --outMultimapperOrder Random \
        --genomeDir $reference \
        --readFilesIn ${base_name}_R1_001.cutadapt.fastq.gz ${base_name}_R2_001.cutadapt.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix ${base_name}.star. \
        --outSAMtype BAM SortedByCoordinate \
        --outBAMsortingThreadN 10 \
        --genomeLoad LoadAndKeep \
        --outSAMunmapped Within \
        --limitBAMsortRAM 40000000000

RNA-seq rRNA STAR • 1.3k views

ADD COMMENT • link updated 22 months ago by Trivas ★ 1.7k • written 22 months ago by jgarces ▴ 50

1

Entering edit mode

You may want to provide the STAR command you are running so we can double check the code, and more detail from the STAR results log so we can check if there is a possibility that rRNA contamination could be contributing to this. Also, how was rRNA depleted in your RNA-seq? Was it a ribosomal depletion kit, poly-dT priming, etc..

If the above has been checked and confirmed you can get a more accurate idea of rRNA contamination by filtering your reads with a program such as BBDuk.

The per base sequence content looks typical of RNA-seq, so I wouldn't worry about that part.

ADD REPLY • link 22 months ago by rpolicastro 13k

0

Entering edit mode

Thanks for the reply. I've just updated the post (good suggestion).

ADD REPLY • link 22 months ago by jgarces ▴ 50

1

Entering edit mode

Also, try running the FastQScreen tool to check if it is actually rRNA contamination. ^_^

ADD REPLY • link 22 months ago by kanika.151 ▴ 130

score 2 · Accepted Answer · 2022-06-07

2

Entering edit mode

22 months ago

Trivas ★ 1.7k

One thing to keep in mind is that the overrepresented sequences in a fastqc report are a subset of the total number of overrepresented sequences. The software looks at the first 100k reads, then any overrepresented sequences found in those first 100k are tracked throughout the rest of your fastq file. See here. In other words, summing up the "counts" in your fastqc report does not tell the full story, especially when a few of them are one nucleotide different from each other. Bottom line, your data likely have much more than 3 million rRNA reads.

ADD COMMENT • link 22 months ago by Trivas ★ 1.7k

0

Entering edit mode

That's absolutely true, thanks for remembering it!

So I guess if I try to align my samples versus a reference with ribosomal sequences I would "detect" if these unmapped reads correspond to rRNA, do you know where I can download a reference with rRNA regions, please?

ADD REPLY • link 22 months ago by jgarces ▴ 50

0

Entering edit mode

I followed this guide: Creating ribosomal RNA reference sequence. So far it's worked well enough for me to monitor how my custom rRNA depletion protocol is working.

ADD REPLY • link 22 months ago by Trivas ★ 1.7k