Question: How to assemble viral genomes when my data contains host DNA as well
gravatar for GBC_Zonatos
12 months ago by
GBC_Zonatos10 wrote:

I'm currently trying to assemble a viral genome, but am unsure about how to proceed on that, as my samples contain both viral DNA and bacterial DNA (from it's host).

I'm using a pipeline that we usually use for bacterial assemblies without problems, using A5 and SPADES to assemble the contigs, and then using both assemblies on GMCloser in order to try and close any gaps. We get very good results for bacteria, and we seem to have achieved good results on the viral DNA as well, managing to find 42 scaffolds, two of them with coverage over 2000. One of these two scaffolds matched to our host bacteria on a Blast alignment against NCBI, while the other matched to a viral genome, similar to what we expected. This viral scaffold, then, was the one with the highest overall/average coverage (cov > 2000), with a length of 40kbp, aligning to a known virus that infects the host cell we found on our samples. It seems like we managed to recover most of the genome, as the complete genome of virus it aligned to is also around 40kbp long.

I'm unsure of how to check for contaminations on that scaffold, though. It appears to be of the right length, and after blasting it on NCBI I've found a few similar virus, for which I retrieved their complete genomes, and compared them with ANI (using mummer alignment), which indeed showed that 35350bp (87.79% of my genome) aligned to a reference viral genome. Using Genome Detective ( I've found that it aligned with 94% coverage/concordance to a specific viral genome, which seems to confirm that it had a good alignment.

Is there any other steps I can use to search this scaffold for host DNA, in case some DNA was badly assembled? I've ran all scaffolds through the 'Genome Detective' tool mentioned above, and only found viral DNA on one other scaffold, on which the tool detected only 3% alignment, which leads me to thinking that scaffold is actually from the host, and that this 3% alignment would be coming from sequences shared between a virus and the host itself. I'm wondering if my 'viral scaffold' might also contain 'shared sequences' and, if that's the case, if any chimeras could have been generated on the assembly, mixing host DNA into it.

Looking for some input from anyone more experience with viral assemblies.

ADD COMMENTlink modified 12 months ago by Antonio R. Franco4.5k • written 12 months ago by GBC_Zonatos10
gravatar for colin.kern
12 months ago by
United States
colin.kern940 wrote:

If the host bacteria has a known genome assembly, you can use any short read aligner, e.g. BWA or Bowtie, to align your raw reads to the bacterial genome. Then take the unaligned reads and run your assembly pipeline just on those.

ADD COMMENTlink written 12 months ago by colin.kern940
gravatar for onestop_data
12 months ago by
onestop_data250 wrote:

I agree with @colin.kern. If the host is not known, you can try to use a tool such as Metabat which uses unsupervised methods to create bins for each organism given contigs from the mixed community - in your case the virus and the host.

ADD COMMENTlink written 12 months ago by onestop_data250
gravatar for Mensur Dlakic
12 months ago by
Mensur Dlakic8.1k
Mensur Dlakic8.1k wrote:

From what you describe, it seems like you have a clean co-assembly of a virus and its host. You already have a suggestion to remove host-mapping reads, which I think is worth trying.

Couple of additional suggestions: 1) check the completeness of your viral and host contig bins using CheckM. It will estimate the host genome completeness which is probably good to know, and if everything is correct it should designate your viral contig into root category with 0% completeness. That would tell you indirectly that a viral DNA is not cellular. 2) Do tetra-nucleotide (or penta-) frequency embedding using PCA or tSNE on all your contigs. You have lots of choices here: I like MetaBAT and CONCOCT, and VizBin is pretty user-friendly. Any of them should work as viral contigs are normally clearly separable from bacterial contigs.

ADD COMMENTlink written 12 months ago by Mensur Dlakic8.1k
gravatar for Antonio R. Franco
12 months ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.5k wrote:

Another possibility:

If you know the bacterial genome, you have the chance to get rid from most of their sequences by filtering the reads by using BBSplit

Then, you need to assemble again with the filtered reads. Neither a mapping with bowtie or a filtering with BBSplit can guarantee you can get rid of all bacterial sequences, since some portion of your reads will be not present in the bacterial genome you use

ADD COMMENTlink written 12 months ago by Antonio R. Franco4.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1469 users visited in the last hour