Bacterial refseq to remove contaminats
6 weeks ago

Hi, community. I have been working in a transcriptome for my species of interest which has an available genome. To increase my transcriptomic database, I decided, after assembling a genome-guided transcriptome, to assemble a de novo genome using the reads that did not map (around 10%~ of my data). However, I suspected that I had contamination. Indeed I mapped my dataset of non-aligning reads to several sequences (from human, Fungi, viral and bacterial), and for the Bacterial genome (E.coli) around 30% of the reads that did not map to my genome mapped to this bacterial genome. Since now I know my source of contamination is probably bacterial, I was wondering if there is any database I can use to map and remove the contaminants reads

6 weeks ago
GenoMax 99k

around 30% of the reads that did not map to my genome mapped to this bacterial genome.

There is no set database for contaminants. You don't really know if those reads came from E. coli to begin with but they seem have similarity to and are thus mapping to that genome. You may find that reads coming from basic metabolism genes in bacteria will map to multiple bacterial genomes equally well, especially if you are allowing for errors in alignment.

At some point you should set aside these suspected contaminant reads and go on with the transcriptome you have already put together. You probably have more interesting biology to discover there.

So it's enough to just use one bacterial genome and relax the parameters with my aligner? I've been using Hisat2 with default parameters like so:

hisat2 -p 4 -x db/ecoli_index -1 06_data_not_aligned/illumina/sample\_R1.not_aligned.fastq.gz -2 06_data_not_aligned/illumina/sample\_R2.not_aligned.fastq.gz


Since a lot of the reads I have mapped to the genome I'm sure a lot of interesting results will come up. However, we want to build a more complete transcriptome to be used in future studies.

However, we want to build a more complete transcriptome to be used in future studies.

It is easy for me to say this so apologies in advance but you will be best served by making additional libraries (perhaps from different life cycle stages/organs etc) rather than going after this small fraction of reads that did not map to your genome in first place.

6 weeks ago

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank DOI: https://doi.org/10.1186/s13059-020-02023-1 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02023-1