Suggestions for which genomes to use when removing microbial contamination with BBsplit
Entering edit mode
2.8 years ago
Dave Carlson ▴ 660

Hi Biostars, I'm currently working on a few pipelines to process and perform various analyses on human RNA-seq data. One of the steps in all of the pipelines is removal of contaminating microbial reads from the input fastq files. Based on recommendations here and elsewhere, I'm using the BBSplit program from BBMap.

My question is in regard to which potential sources of contamination I should be mapping to. Currently, I've downloaded essentially all the microbial RefSeq assemblies (bacterial, archaeal, protozoan, viral, fungal) and concatenated them together into a single "contaminants" fasta file.

However, using all of these genomes makes the analysis take a couple of hours for each sample, and more importantly uses a prohibitively large amount of memory (> 500 GB). I'd like to pair down the number of microbial assemblies I use in this analysis, but I'm not sure where to start.

Are there any "standard" sets of genomes that people typically use when decontaminating fastq data? Alternatively, if anybody has performed this sort of analysis before and has suggestions for which species I should (or shouldn't!) include, I'd love to get your advice.



genome BBmap • 692 views
Entering edit mode

Is contamination a real concern? If your analysis pipeline includes mapping to the human genome, most or all contaminants would be filtered out at this stage.

Entering edit mode

That's a good question. Contamination is not of special concern (this isn't ancient DNA!). I mostly just wanted to be thorough. But yes, all pipelines will involve either mapping to the human reference genome or transcriptome.


Login before adding your answer.

Traffic: 669 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6