Question

How to obtain a novel virus' complete genome from de novo assembly

0

Entering edit mode

4.3 years ago

brianna.flynn ▴ 20

Hello all,

I have a viral genomics question for you! I am analyzing an RNA-Seq library comprised of pooled RNA samples from bumble bees across the United States, in order to quantify the diversity of viruses that infect these populations. I assembled the RNA-Seq reads into contigs using the de novo assembly option in CLC workbench, and after searching for viral contigs using BLAST, found one novel virus candidate. From the BLAST search, I know that the candidate is closely related to a mosquito virus family. Based on an alignment and search in the NCBI conserved protein domain database, I roughly know what size and what protein families the virus is likely to have. However, because it is a new virus, I have no reference to check if I've obtained the full genome. After aligning it with its close relatives, the novel virus contig is roughly a third of the size of the other related viral genomes, indicating that this contig probably does not represent the complete genome.

To solve this issue, I figure that I need to redo the assembly with a pipeline that is more sensitive to recovering viral genomes as opposed to CLC workbench. However, I'm not sure what the best way to proceed is. Is there a particularly good de novo assembler for obtaining complete viral genomes?

I would like to know if there is a way to obtain the complete genome of the novel virus from the RNA-Seq reads given that I approximately know its size, its close relatives, and what conserved proteins it should have given my phylogenetic analysis is correct? Is there a way I can use a close relative to map the reads onto, despite not having an exact reference to use?

Any suggestions on how to proceed would be greatly appreciated! Thank you in advance, Brianna

RNA-Seq Assembly Virus • 1.1k views

ADD COMMENT • link updated 4.3 years ago by Mensur Dlakic ★ 27k • written 4.3 years ago by brianna.flynn ▴ 20

0

Entering edit mode

What all is expected to be in the sample you sequenced? Bee RNA + RNA viruses + ?

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

Yes, we expect to find bee host RNA, plant RNA and RNA viruses (typically from pollen, though some plant viruses do infect bee hosts), and insect specific RNA viruses.

ADD REPLY • link 4.3 years ago by brianna.flynn ▴ 20

score 1 · Accepted Answer · 2020-01-04

1

Entering edit mode

4.3 years ago

Mensur Dlakic ★ 27k

How sure are you that there are no other viral contigs in the existing assembly? If you do 4-mer or 5-mer frequency-based embedding (t-SNE, UMAP), viral contigs are usually easy to spot on the outside even without BLASTing.

Some of your questions are answered in this thread. If you expect to have certain proteins, PLASS may help your assembly.

ADD COMMENT • link 4.3 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Hey Mensur, I didn't clarify this in the parent post but we did find other viral contigs (known bee, insect and plant viruses ) - the one viral contig I refer to the most is the one that we think is a new species. Thank you for the suggestions! I'll look into using PLASS

ADD REPLY • link 4.3 years ago by brianna.flynn ▴ 20