We are a small group of undergrads, mostly sophomore, from a small HBCU, and learning bioinformatics and genomics that is not at all part of our regular syllabus, by trying to teach one another :) - And because of SARS-CoV-2, we have a little more time at hand.
Currently we are trying to learn and understand the theory and practice behind how to decontaminate RNA-Seq reads so we can map the cleaned reads to the genome of our plant species of interest. We tried FastQ_Screen and for one test case, majority of reads mapped to unknown, but quite a few were to mouse and rat.
fastq_screen --conf fastq_screen.conf --force --quiet --subset 0 $FASTQ_Input
Config file pointed to following reference sequence file, indexed for use by the underlying Bowtie2:
## Adapters - sequence derived from the FastQC contaminats file found at: www.bioinformatics.babraham.ac.uk/projects/fastqc ## Ecoli- sequence available from EMBL accession U00096.2 ## Vectors - Sequence taken from the UniVec database ## Lambda ## Mitochondrion ## PhiX - sequence available from Refseq accession NC_001422.1 ## rRNA ## Human - sequences available from ## ftp://ftp.ensembl.org/pub/current/fasta/homo_sapiens/dna/ ## Mouse - sequence available from ## ftp://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/ ## Rat
Our questions are these:
Question 1. When decontaminating, is it essential to include the genome of our species of interest, in addition to the ones being checked against - adapters, PhiX, rat, mouse, human, bacterial, etc? If read maps BEST to our genome of interest, then it shouldn't matter if it also maps to those other references, it would be considered a true read, and not a contamination, right? Or is the answer to that "complicated" or "it depends"? :)
Question 2. I want to decontaminate my RNA-Seq reads and ultimately map to plant genome P. And I suspect contamination from mouse and rat genomes, M and R respectively. And since these are all eukarytes, a small but non-zero fraction of all 3 genomes would be common, right? So then, do I need to conduct the FastQ_Screen on this modified genome collection instead:
a. M - P (mouse, but without genomic regions also found in my plant species) b. R - P (rat, but without genomic regions also found in my plant species) c. P (full genome of plant species of interest)
If indeed that is the case, what is the bioinformatic protocol to generate M - P subtracted genome sequences, given the M and P genome assemblies ?
Thanks, in advance, and we wish you all to be safe from SARS-CoV-2!!