Question

Filter Bacterial sequences from Metagenomics

2

Entering edit mode

7.7 years ago

tucanj ▴ 100

I have a large metagenomic RNA-seq dataset that I am trying to assemble to find viral sequences but it is too large for my hardware (52gb RAM). I can see that there is a lot of bacterial contamination from many different species when I BLAST reads. I want to filter out all bacterial reads so that I can assemble. Ideas?

Download all bacterial genomes from Refseq and try to bowtie to that (will take a long time). As well, since when has the compressed Refseq bacterial fna files reached 72gb (when combined)?!? The last all.bacteria.gz file in Refseq archive from 2015 is 2.7gb...
Somehow condense all bacterial genomes into non-redundant, then align?
Other ideas?

RNA-Seq genome alignment sequencing • 3.9k views

ADD COMMENT • link updated 7.7 years ago by Brian Bushnell 20k • written 7.7 years ago by tucanj ▴ 100

3

Entering edit mode

Have you checked Kraken or Kaiju to do the binning? You can even use MG-RAST to do the taxonomic classification and then download only the viral reads.

ADD REPLY • link 7.7 years ago by sentausa ▴ 650

0

Entering edit mode

Kaiju will work with only 50GB RAM. You can also use the web server and upload your reads there for taxonomic classification.

ADD REPLY • link 7.7 years ago by Peter ▴ 90

1

Entering edit mode

If you are not looking for novel viral sequences then perhaps doing the binning in reverse may be better. Get the RefSeq viral sequences from here and then use BBSplit to bin the reads into virii and rest.

ADD REPLY • link 7.7 years ago by GenoMax 141k

1

Entering edit mode

Thanks for your clarification: I am looking for novel viral sequences, so I want to filter using close alignment to known bacterial species. Asssembling part then aligning original reads to identified contaminants could work but their are too many different bacterial contaminant to make this practical

ADD REPLY • link 7.7 years ago by tucanj ▴ 100

0

Entering edit mode

If novel things is the requirement then slogging though multiple rounds of alignments/assembly may be order of the day. As @Brian noted below this process is going to be fraught with hurdles and you are likely to hit many false positives along the way. I don't see an easy solution.

ADD REPLY • link 7.7 years ago by GenoMax 141k

score 6 · Accepted Answer · 2016-07-26

6

Entering edit mode

7.7 years ago

Brian Bushnell 20k

You cannot determine whether there is bacterial contamination simply from BLASTing reads. Similarly, it is impossible to trivially filter out bacterial contamination by mapping to all known bacteria, because viruses tend to share sequence with their hosts, and there's no guarantee that your bacteria are in the reference dataset.

You need a completely different approach. Perhaps you should assemble the data, annotate the assemblies, and then pull out contigs with genes known to occur only in viruses...

ADD COMMENT • link 7.7 years ago by Brian Bushnell 20k

0

Entering edit mode

I also assembled parts of the reads/used digital normalization so that I could assemble and found many bacterial contigs in the assembly. You are right that there is no guarantee that my bacteria are in the reference set but at least I will be able to reduce the size of the original dataset. I could be wrong but if I require that both paired ends (101bp) align concordantly to the bacterial db the false positive rate would be low.

The reason for the original filtering is I cannot assemble due to the size of the dataset. I could assemble with digital normalization and then go from there with your idea.

Thanks for your help!

ADD REPLY • link 7.7 years ago by tucanj ▴ 100