Detect contaminating organism
1
0
Entering edit mode
6 months ago

Hello everyone!

I am new to bioinformatics and have never faced such a problem, but now I am working with a dataset GSE172189 that appears to be contaminated by bacteria (only ~60% of reads are aligned with Salmon, and taxonomy analysis on SRA says there are up to 25% reads coming from bacteria). I wanted to detect contaminating species to finally get rid of them. In such a case, it seems I could take most presented organisms in SRA Taxonomy Analysis, download their genomes and do BBSplit on them. But I am not sure it is as effective as I want, and in general case I am not aware of contaminating species at all. How do you proceed in this case? Is there a tool that helps to BLAST reads on multiple organisms and get frequency statistics?

Thank you in advance!

contamination taxonomy BLAST • 874 views
ADD COMMENT
0
Entering edit mode

I wanted to detect contaminating species to finally get rid of them.

You could use the reads that salmon was able to assign and ignore the rest.

NCBI uses a tool called STAT (LINK) for the taxonomy results they show in SRA.

ADD REPLY
0
Entering edit mode

Thank you! According to the dataset description, reads also seem to contain UMI, that's why I am not sure that all unaligned reads are coming from contamination. My intention was to first get rid of contaminated reads, and then deal with UMIs.

ADD REPLY
0
Entering edit mode

Which exact sample out of GSE172189 are you referring to? I checked a couple and they all seem to say 99% Euk. I also don't see any mention of UMI.

ADD REPLY
0
Entering edit mode

For example, GSM5243620 has ~25% Bact., and GC distribution is weird. GC distribution by FastQC They claim having linked UMIs in the article: "Next, the 3’ ends of first-strand cDNA fragments were ligated with a linker containing Illumina-compatible P5 sequences and Unique Molecular Identifiers. "

ADD REPLY
0
Entering edit mode

If you must work with this data then you could align the data (do not use salmon) with an aligner like STAR/BBMap. Then recover the reads that mapped from original dataset (depending on where the UMI's are they will likely be soft-clipped by the aligner) and do what you need to afterwards.

Use filterbyname.sh for extracting the mapped reads from original data files.

ADD REPLY
0
Entering edit mode

Thank you for such a great explanation!

ADD REPLY
0
Entering edit mode

You could include a tool like Kraken2 and a bacterial and human databases to identify the likely source organism of your data. I like to use these in QC when I use publicly available raw dataset.

But I also agree with the other commenter, ignoring reads that don't map to your reference is a simple way of removing likely contamination (and probably a small proportion of unmapped target species reads).

ADD REPLY
0
Entering edit mode

Thank you for suggestion!

ADD REPLY
0
Entering edit mode
6 months ago

If you run (from BBTools):

sendsketch.sh in=reads.fq depth level=genus reads=10m

...it will indicate the approximate kmer depth per organism. It's way faster than BLAST. Post the results here if you have trouble interpreting them. Once you know the primary contaminants, you can use BBSplit or Seal with their references. Normally, though, if you have a bacteria contaminating a eukaryote for which you have a reference, you can ignore the bacterial reads since they won't map; contamination is more of an issue when

  1. You are assembling a novel species, or
  2. The contaminant is in the same clade, e.g., mouse contaminating human
  3. You're dealing with metagenomes or are interested in commensals

However, I do like to decontaminate early when possible.

ADD COMMENT

Login before adding your answer.

Traffic: 1242 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6