Modernizing Reference Genome Assemblies, Including Common Structural Variation In Them.
2
2
Entering edit mode
9.8 years ago
William ★ 5.1k

I just came across a paper from last year in which is argued that reference genomes need to include common structural variations to better enable us to map sequencing data to them. And therefore to understand (smaller?) less common variation from resequencing data in the right context ( in personal genomics, but also model organism).

The reference object would need to change for this from a set of linear objects to a set of graphs, one for each chromosome. And mappers need to be able work with these new references. Mapping data would result in a path trough this graph, which would identify the broader population group(s) (ancestry) of your sample. Annotation of the reference genome also needs to somehow work with multiple paths to the graph instead of one path (scale) on a linear object.

This got me interested in the current status the inclusion of common structural variation in references and the software and data ecosystem around it. Are there any such references / mappers yet? Will all common SNP / Indel / CNV/ SV data move into the reference ?

The paper is from the Genome Reference Consortium(GRC) (EBI, Sanger, NCBI etc.) http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001091

reference genome variation annotation • 2.1k views
ADD COMMENT
3
Entering edit mode
9.8 years ago
jts ▴ 240

I don't know of any mapper that can fully use population data. For one possible algorithmic approach, you might be interested to read this paper by Veli Mäkinen's group. They develop a way to index and search multiple related genomes. Its quite technical but very interesting.

ADD COMMENT
1
Entering edit mode

Thanks for the very interesting reference.

ADD REPLY
1
Entering edit mode
9.8 years ago
SES 8.5k

The Cortex assembler can be used for assembly and variant detection in population samples, and I'm sure it can be applied to the topics you mention. There is an impressive list of publications that have been produced using this software, so that might be a good place to start exploring how it can be applied to your questions.

ADD COMMENT
2
Entering edit mode

I'm afraid Cortex doesn't do what you want, although I (and others) am (are) working on various related things to what Willaim wants. A couple of comments on William's actual question, then some brief Cortex stuff

  1. Replacing chunks of the reference with major alleles or longer alleles will not really work (except in special lucky cases) - it does sound reasonable, but you run into problems relating to how these SVs fit into the population haplotypes, and paired-end info
  2. Many people are indeed thinking along the lines you mention, but I don't think there is a solution yet.

Re Cortex: 1. Although you can't do standard alignments to the graph, it does allow you to "align" fasta/fastq to a population/multicolour graph and get back coverages in each sample/colour. For example you might use that by aligning a set of gene fasta to a population graph to see whether there is gene copy number change happening. 2. Our latest paper on microbial genomics, shows how you can use a graph as a repository of sequence and variation in a population, against which you can compare new samples.

However the bottom line is still the bottom line I'm afraid - the current version of Cortex is just a step along the way

best regards

Zam

ADD REPLY
0
Entering edit mode

Thanks for the comments and clarification. I will definitely try to follow your progress in this area with Cortex or other tools.

ADD REPLY

Login before adding your answer.

Traffic: 1864 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6