Entering edit mode
12 weeks ago
Abdul
▴
10
Hi,
I am working with RNA-Seq Illumina PE – 150 bp dataset. I was wondering if there is a way to identify contaminations (such as mitochondrial DNA contamination, any other types of contamination) in the data, and remove the same?
Can this be removed pre or post alignment or maybe filter by reads?
Best Regards,
Abdul
They would be removed post alignment, because you can't tell what a sequence represents until you align it to something. However, without knowing your specific purpose, or what you're trying to achieve, it's hard to recommend a specific strategy. You can filter your BAM files to remove certain alignment targets (e.g. Mitochondrial DNA), or you can generate counts on features and remove the features you want to ignore (i.e. rows in your count table representing mitochondrial genes, etc.). All come with caveats for your analysis.
seidel Thank you for the feedback. I am working on the gene counts file filtered to include the only
protein coding genes
andlncRNAs
.I was quickly going through the script and assume that FASTQ files were assessed using
FastQC
> aligned usingbowtie2
with inclusion ofchrM
in the reference > filtered + trimmed usingfastp
> quantified usingrsem
to obtain gene counts > filtered to includeprotein coding genes
andlncRNAs
.seidel edited my reply.
You could use
bbsplit.sh
to bin the reads so that the contaminating reads can be separated. See: Extracting contaminated reads from the sequenced dataGenoMax
Thank you for the inputs.