2.4 years ago by
Walnut Creek, USA
It's best to remove contamination as early as possible, to avoid chimeric assemblies and reduce assembly time, memory requirements, and fragmentation. The disadvantage is that it is harder to identify read-level contamination because reads are short, so sometimes it makes more sense to do these things iteratively - assemble, identify contamination, map to the contaminant genomes, remove contaminant reads, reassemble what's left over.
It seems extremely unlikely to me that you actually have contamination from so many different unrelated sources, unless they were multiplexed together. Is your lab researching all of these organisms? I've never seen a situation in which anything is heavily contaminated with multiple different animals. Often, there is heavy contamination by one organism, or trace contamination from multiple (particularly microbes). It's more likely that these BLAST hits are wrong (perhaps, the contaminant organism is not in the databases, even at a high taxonomic level). Rather than BLASTing in protein space, I suggest you do so in DNA space (against nt) to see if you can get some long exact matches to a specific organism rather than some genes that might be heavily conserved; that would give greater confidence.
I wrote several tools for decontamination, particularly BBSplit and Seal, both of which map reads to multiple genomes simultaneously to bin them into one file each by the best-matching organism, so that you can get a clean assembly of each one. They require assemblies, though. Seal may work better with transcriptomes than BBSplit. If this is cross-contamination (due to multiple projects being handled together or multiplexed together), CrossBlock is an even better solution.
Another thing to note is that you cannot quantify contamination by mapping the assemblies to nr/nt, particularly in transcriptomes. You have to use the raw reads, because even 1% contamination at the read level can assemble into 50% or more of your assembly. So, have you quantified the contamination at the read level? Again, I use Seal for that (since BLAST is way too slow to process all reads) once the contaminant organisms have been identified, but you can just blast a few thousand of the reads, instead.