Forum: Confirmation of metagenomic data
1
gravatar for Whetting
2.2 years ago by
Whetting1.5k
Bethesda, MD
Whetting1.5k wrote:

Hi,

As a community we are unsure about how to deal with new viruses discovered through metagenomic tools. Currently, the rules state that in order to be considered a novel (papillomavirus) isolate the entire genome has to be cloned and sequenced. However, more and more people are choosing not to follow this rule anymore, especially with metagenomic data. We would like to develop a new set of rules that would allow "metagenomic genomes" to be considered "real" [edit based on Josh's answer: By "real" I mean is the identified genome the natural occuring complete sequence of this virus (e.g. assembly induced hybrids)] . One of the concerns is the reproducibility of the assembly method. We are percolating the idea of having two independent labs perform the assembly de novo as a confirmation step.

I have two questions for you guys and gals:

1) Does this sound reasonable, too strict, not strict enough?

2) Obviously, assembly is a time consuming thing and isn't trivial. Would you guys like to share some thoughts on preferred assemblers, pipelines, etc...

 

as always,

Thanks!

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Whetting1.5k
4
gravatar for Josh Herr
2.2 years ago by
Josh Herr5.5k
University of Nebraska
Josh Herr5.5k wrote:

1) Does this sound reasonable, too strict, not strict enough?

I'm a little thrown off by your question -- of course genomes from mixed samples are real genomes.  You're trying to establish standards with which to describe new species of viruses from metagenomic data.  There are already a lot of groups doing this, so I think standards are worked out to a certain extent.  People have been defining organisms on the basis of their DNA for 4 (or more decades now), why should more data (really the only difference you have with metagenomic data, as we have been sequencing mixed samples through cloning, etc., for the last 30 years) change things?  

Here's a commentary I co-authored earlier this year on the systematics and taxonomy of environmentally derived sequence data (focused on plants as a host - but human host associated papillomaviruses are not any different in an ecological sense), I hope it helps.

2) Obviously, assembly is a time consuming thing and isn't trivial. Would you guys like to share some thoughts on preferred assemblers, pipelines, etc...

This is a quickly evolving and constantly changing field right now and I don't think the community has come up with a preferred pipeline or system.  I feel like I could write a book on what to do and what not to do here, but I think you have to dive into the literature.  

The main assembly program I use now for metagenomics (meghit), didn't exist a year ago, so it's hard to gauge standards at the moment.  There are tons of new great tools, and twice as many poor tools out there.  

For what you are working on with papillomaviruses, I would highly recommend the pipeline (though it's less than 2 years old and probably dated at this moment) from this really good paper from Ital Sharon in Jill Banfield's Lab: Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization.

The bottom line (Istvan Albert mentions this in his answer) is the fear of assembly chimeras or misassembled metagenomic genomes.  How do you know what you have assembled is actually the correct genome in your sample?  Long reads will change this field, but for now, you have to be extra careful making claims you have a new organism on the basis of metagenome assembly.  I would look at any metagenome assembly or short-read annotation with skepticism.  

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Josh Herr5.5k

Thanks Josh, I agree that I need to clarify my question (see edits). I assume that the virus in the samples is real, the concern is that the assembled virus reflects this reality. Like Istvan pointed out, we are concerned about chimerics etc.
 

ADD REPLYlink written 2.2 years ago by Whetting1.5k
3
gravatar for Istvan Albert
2.2 years ago by
Istvan Albert ♦♦ 74k
University Park, USA
Istvan Albert ♦♦ 74k wrote:

I would suggest validating via platforms such a the MiniION. Currently this produces error prone but very long reads - assembling from that is a bit tedious (to say the least) but verifying assemblies with it is very straightforward. There is nothing that proves that an assembly is correct than (even a messy) alignment over the entire length. The problems with assemblies are usually not about the sequence identities but assembling unrelated fragments.

In silico assemblies especially from metagenomic data run the risk of assembling chimeric data, have a really hard time with genomes that may share similarities or those that come from diverse populations.

ADD COMMENTlink written 2.2 years ago by Istvan Albert ♦♦ 74k

that's an interesting point. It may be cost prohibitive for the "confirming" lab to use something like MiniION. Likewise, requiring a certain technology may not be the best avenue either?

ADD REPLYlink written 2.2 years ago by Whetting1.5k
1

A MinION device costs $1000 the flow cell can be run multiple times, and you can run it to validate different findings.

I could see this being offered commercially as well. We are just not trained to think in terms of: All I need is 10 good reads that are covering my entire virus. 

ADD REPLYlink written 2.2 years ago by Istvan Albert ♦♦ 74k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 940 users visited in the last hour