Reference genome for RNA-Seq reads decontamination
Entering edit mode
3.5 years ago

We are a small group of undergrads, mostly sophomore, from a small HBCU, and learning bioinformatics and genomics that is not at all part of our regular syllabus, by trying to teach one another :) - And because of SARS-CoV-2, we have a little more time at hand.

Currently we are trying to learn and understand the theory and practice behind how to decontaminate RNA-Seq reads so we can map the cleaned reads to the genome of our plant species of interest. We tried FastQ_Screen and for one test case, majority of reads mapped to unknown, but quite a few were to mouse and rat.

Syntax was:

fastq_screen --conf fastq_screen.conf --force --quiet --subset 0 $FASTQ_Input

Config file pointed to following reference sequence file, indexed for use by the underlying Bowtie2:

## Adapters - sequence derived from the FastQC contaminats file found at:
## Ecoli- sequence available from EMBL accession U00096.2
## Vectors - Sequence taken from the UniVec database
## Lambda
## Mitochondrion
## PhiX - sequence available from Refseq accession NC_001422.1
## rRNA
## Human - sequences available from ##
## Mouse - sequence available from ##
## Rat

Our questions are these:

Question 1. When decontaminating, is it essential to include the genome of our species of interest, in addition to the ones being checked against - adapters, PhiX, rat, mouse, human, bacterial, etc? If read maps BEST to our genome of interest, then it shouldn't matter if it also maps to those other references, it would be considered a true read, and not a contamination, right? Or is the answer to that "complicated" or "it depends"? :)

Question 2. I want to decontaminate my RNA-Seq reads and ultimately map to plant genome P. And I suspect contamination from mouse and rat genomes, M and R respectively. And since these are all eukarytes, a small but non-zero fraction of all 3 genomes would be common, right? So then, do I need to conduct the FastQ_Screen on this modified genome collection instead:

a. M - P (mouse, but without genomic regions also found in my plant species)
b. R - P (rat, but without genomic regions also found in my plant species)
c. P (full genome of plant species of interest)

If indeed that is the case, what is the bioinformatic protocol to generate M - P subtracted genome sequences, given the M and P genome assemblies ?

Thanks, in advance, and we wish you all to be safe from SARS-CoV-2!!

FastQ_Screen RNA-Seq • 1.5k views
Entering edit mode

You should definitely keep your genome of interest. Otherwise, you don't know if all those contaminants are just homologous reads (which basically leads to your Question 2).

Are you sure you want to decontaminate? If you are working with a plant species, you probably wouldn't have mouse or rat contamination. Most people never decontaminate. If you are just learning RNA-seq, this may not be something that you should worry about yet.

Entering edit mode

Thanks a lot for your quick reply.

I suppose judging a read that is mapping to both P as well as M, for example, as a real read from P, rather than from M,

is in effect the same as this protocol:

  1. computing M - P, from P and M,
  2. mapping to P versus M - P,
  3. choosing only those reads that map to P, but not to M - P?

About your recommendation for skipping decontaminating altogether, if you see the screenshot of FastQ_Screen results for our example / test file (below), I think there's quite a bit of human, mouse and rat contamination it seems, right?

Unless ALL of this is mapping to genomic regions found even in my reference genome P for the plant species of interest, we think decontamination would improve downstream results, yes?

We don't know if this is the case, since we did not include my plant genome P in this step thus far.


Your thoughts, please? Thanks, in advance.

Entering edit mode

All I see is that you have no notable contamination. I vote for not overcomplicating things. Align your reads first to your reference genome and if this looks bad (like many reads unmapped) then start troubleshooting. Most likely your mapping will be fine and you do not even need to bother with decontamination. Why do you think that you even have contamination?

Entering edit mode

All I see is that you have no notable contamination.

To clarify, none of those achieve even 1% mapping rate.

Entering edit mode

We have ~ 50 RNA-Seq libraries, and we want to follow some protocol where we do not have to revisit and revise it, but instead use steps that remain the same across all 50 libraries.

Our mentor looked at this dataset and only said a few of the 50 libraries have contamination (didn't give more dtails). And he told us to come up with a pipeline that properly processes even the worst library in our dataset, but without adversely affecting (excessive filtering of) the good ones, at the time of genome mapping to obtain gene counts.

In your comment above, I suppose you are trying to indicate that bottom 3 lines for Human, Mouse, Rat are quite typical for a reasonably OK quality plant RNA-Seq library, and we should not be concerned about these relatively low counts, correct?

Regardless, can you please explain what these column headers mean?

  • One hit / one genome
  • Multiple hits / one genome
  • One hit / multiple genomes
  • Multiple hits / multiple genomes

Thank you!

Entering edit mode

There is a video explaining the FastQ Screen results:

They show examples of good and contaminated data in the video.

Entering edit mode

Thank you so much for this YouTube link, it was for us easy to understand Along with this related video, we understand enough to proceed with downstream processing steps now. Thank you igor, thank you ATpoint


Login before adding your answer.

Traffic: 1401 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6