Question: Tools For Combined Genome Variation Analysis Of Short Sequencing Reads For Several Genome Strains
gravatar for 14134125465346445
6.2 years ago by
United Kingdom
141341254653464453.4k wrote:

What are the recommended tools for the combined genome variation analysis of short sequencing data for several genome strains of a given organism? I have seen that traditionally people have invested more in trying to deep sequence and assemble one specific strain or, if possible, single individual, to have the reference genome assembly, then done some more sequencing with the money left to assess the variability in the other important strains.

The only case a few years ago that wasn't like this was the Sanger sequencing of several strains of Drosophila simulans, all low coverage, that were pooled and used to define the simulans genome reference.

If one takes the approach of doing the same amount of sequencing for a group of strains without an existing reference genome, what would be the best tools to assess the genomic variability in the group of strains?

EDIT: for example, in this paper for a cattle pathogen, the authors did the resequencing of 10 strains for a species that already had a reference genome. They did a very sound variation analysis by comparing the results of the 10 resequenced strains mapped to the reference. My question is: what tools would someone use in the case where the sequenced 10 strains where for a species without a reference genome?

genome assembly tools variation • 2.5k views
ADD COMMENTlink modified 6.2 years ago by zam.iqbal.genome1.7k • written 6.2 years ago by 141341254653464453.4k

I have the feeling that you would need to compile a reference first.

ADD REPLYlink written 6.2 years ago by fo3c420

is what you asking is resemble to Metagenomics? but instead of collection samples directly from environment you are talking about sequencing them from culture media in the lab with out reference genome and comparing them?

ADD REPLYlink written 6.2 years ago by Medhat8.2k
gravatar for zam.iqbal.genome
6.2 years ago by
United Kingdom
zam.iqbal.genome1.7k wrote:

The Cortex variation assembler is designed for precisely this! The idea is to simultaneously de novo assemble a joint graph of all your samples, and look for differences, without using a reference. You can then use population/segregation statistics to distinguish variants from repeats and errors. FInally, you can use a reference if you have one to provide coordinates. First published here

De novo assembly and genotyping of variants using colored de Bruijn graphs. Z Iqbal, M Cacao, I Turner, P Flicek, G McVean, Nature Genetics (2012)

and then here we recently published on how to use it for microbial genomics, with a new pipeline wrapper to make it a lot more user friendly (give it an index file listing sample id's and which fastq they have, and it does all the assembly, error removal, variant discovery, genotyping and makes a VCF.

High-throughput microbial population genomics using the Cortex variation assembler. Z Iqbal, I Turner, G McVean, Bioinformatics 2013

Here's an example of it being used in a longitudinal study looking at S. aureus

Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. B. Young, T Golubchik et al, Proc. Nat. Acad. Sci Proc. Nat. Acad. Sci (2012)

Sorry for the self-publicity, but it is an answer to your question! You do need to think carefully about how experimental design (number of samples, coverage per sample, read length) affects your power to discover variants. Assembly is typically less sensitive than mapping, although more specific.

best Zam

ADD COMMENTlink modified 6.2 years ago by Istvan Albert ♦♦ 80k • written 6.2 years ago by zam.iqbal.genome1.7k

Hi Zam, brilliant, thanks! Can you comment a bit more on the coverage and read length per sample on your same answer? Assuming for example two situations: 5 to 10 samples of 2-3Gbp vertebrate-sized farm animal and between 10-100 samples of a 50-300Mbp pathogen species (all Illumina data and all ideally "easy-to-produce" libraries): how would you spend the money?

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by 141341254653464453.4k

Well, it depends what you want to achieve. Are you essentially SNP-finding? Or do you want to find SVs (I guess not?)? Are you going to do population genetics? Is your aim to be as sensitive as possible, and then filter? Or would you just like a very conservative set? Do you want to build a SNP array? There's a section towards the end of the Supp Material of the Nature Genetics paper which covers some options. That's not to say I won't reply here too! give me an idea what you want, and I'll try to give you a decent answer cheers Zam

ADD REPLYlink written 6.2 years ago by zam.iqbal.genome1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1639 users visited in the last hour