Could high duplication be from ribosomal RNA in RNA-seq samples?
Entering edit mode
5.3 years ago
mmrcksn ▴ 50


I have some paired-end RNA-seq data. My samples were pretty low concentration (~1ng) total RNA from an isolated cell type. For library prep, we did a poly-A capture to select mRNA.

The FASTQC reports show pretty bad duplication (some are as bad as only 2% remaining after deduplication).

I did this command to look at some of the most dominant sequences in my fastqs:

grep -A 1 '@K00179' <sample.fastq>  | head -1000000 | grep -v '^@' | grep -v '^-' | sort | uniq -w 30 -c | sort -n -r | head -100 >> domseqs.100

and found many sequences that, when I searched with BLAST, match with stuff like this:

"Mus musculus clone contig 6 chromocenter region genomic sequence"

These chromocenter sequences also seem to match with rRNA, as further in the results there are things like: "Mus musculus 45S pre-ribosomal RNA (Rn45s), ribosomal RNA", "Mus musculus 28S ribosomal RNA (Rn28s1), ribosomal RNA"

Is it possible that even with polyA capture, rRNA slipped in? What exactly does it mean that I have a bunch of these "contig # chromocenter region genomic sequence" in my RNA-seq data?

RNA-Seq rRNA sequencing duplication • 1.9k views
Entering edit mode

Yes, the poly-A capture allow you only to enrich in non poly-A sequences : you don't get rid of all rRNAs and other non-poly-A RNAs. As you probably know, > 90 % of the transcriptome is composed of rRNAs so if you end up with 20% rRNAs or so after poly-A enrichment or ribodepletion, its already quite an improvement.

Entering edit mode

Thanks for your reply! I am pretty new to all this so I just wanted to make sure.


Login before adding your answer.

Traffic: 2613 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6