Hello,
I have some paired-end RNA-seq data. My samples were pretty low concentration (~1ng) total RNA from an isolated cell type. For library prep, we did a poly-A capture to select mRNA.
The FASTQC reports show pretty bad duplication (some are as bad as only 2% remaining after deduplication).
I did this command to look at some of the most dominant sequences in my fastqs:
grep -A 1 '@K00179' <sample.fastq> | head -1000000 | grep -v '^@' | grep -v '^-' | sort | uniq -w 30 -c | sort -n -r | head -100 >> domseqs.100
and found many sequences that, when I searched with BLAST, match with stuff like this:
"Mus musculus clone contig 6 chromocenter region genomic sequence"
These chromocenter sequences also seem to match with rRNA, as further in the results there are things like: "Mus musculus 45S pre-ribosomal RNA (Rn45s), ribosomal RNA", "Mus musculus 28S ribosomal RNA (Rn28s1), ribosomal RNA"
Is it possible that even with polyA capture, rRNA slipped in? What exactly does it mean that I have a bunch of these "contig # chromocenter region genomic sequence" in my RNA-seq data?
Yes, the poly-A capture allow you only to enrich in poly-A sequences : you don't get rid of all rRNAs and other non-poly-A RNAs. As you probably know, > 90 % of the transcriptome is composed of rRNAs so if you end up with 20% rRNAs or so after poly-A enrichment or ribodepletion, its already quite an improvement.
Thanks for your reply! I am pretty new to all this so I just wanted to make sure.