Question: Filter Bacterial sequences from Metagenomics
gravatar for tucanj
3.2 years ago by
tucanj80 wrote:

I have a large metagenomic RNA-seq dataset that I am trying to assemble to find viral sequences but it is too large for my hardware (52gb RAM). I can see that there is a lot of bacterial contamination from many different species when I BLAST reads. I want to filter out all bacterial reads so that I can assemble. Ideas?

  1. Download all bacterial genomes from Refseq and try to bowtie to that (will take a long time). As well, since when has the compressed Refseq bacterial fna files reached 72gb (when combined)?!? The last all.bacteria.gz file in Refseq archive from 2015 is 2.7gb...

  2. Somehow condense all bacterial genomes into non-redundant, then align?

  3. Other ideas?

ADD COMMENTlink modified 3.2 years ago by Brian Bushnell16k • written 3.2 years ago by tucanj80

Have you checked Kraken or Kaiju to do the binning? You can even use MG-RAST to do the taxonomic classification and then download only the viral reads.

ADD REPLYlink written 3.2 years ago by sentausa640

Kaiju will work with only 50GB RAM. You can also use the web server and upload your reads there for taxonomic classification.

ADD REPLYlink written 3.2 years ago by Peter90

If you are not looking for novel viral sequences then perhaps doing the binning in reverse may be better. Get the RefSeq viral sequences from here and then use BBSplit to bin the reads into virii and rest.

ADD REPLYlink written 3.2 years ago by genomax72k

Thanks for your clarification: I am looking for novel viral sequences, so I want to filter using close alignment to known bacterial species. Asssembling part then aligning original reads to identified contaminants could work but their are too many different bacterial contaminant to make this practical

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by tucanj80

If novel things is the requirement then slogging though multiple rounds of alignments/assembly may be order of the day. As @Brian noted below this process is going to be fraught with hurdles and you are likely to hit many false positives along the way. I don't see an easy solution.

ADD REPLYlink written 3.2 years ago by genomax72k
gravatar for Brian Bushnell
3.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You cannot determine whether there is bacterial contamination simply from BLASTing reads. Similarly, it is impossible to trivially filter out bacterial contamination by mapping to all known bacteria, because viruses tend to share sequence with their hosts, and there's no guarantee that your bacteria are in the reference dataset.

You need a completely different approach. Perhaps you should assemble the data, annotate the assemblies, and then pull out contigs with genes known to occur only in viruses...

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Brian Bushnell16k

I also assembled parts of the reads/used digital normalization so that I could assemble and found many bacterial contigs in the assembly. You are right that there is no guarantee that my bacteria are in the reference set but at least I will be able to reduce the size of the original dataset. I could be wrong but if I require that both paired ends (101bp) align concordantly to the bacterial db the false positive rate would be low.

The reason for the original filtering is I cannot assemble due to the size of the dataset. I could assemble with digital normalization and then go from there with your idea.

Thanks for your help!

ADD REPLYlink written 3.2 years ago by tucanj80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2163 users visited in the last hour