Question

Approaches in a post-WGS era for encoding genome information

4

Entering edit mode

9.6 years ago

Sean Davis 26k

In a post-WGS era where we have many whole genomes, are there approaches or thoughts about how to encode personal genomes beyond diffs from a reference (as graphs, for example)? I have found the topic difficult to search for and wondered if anyone could share some thoughts, references, or links of interest.

genome • 1.5k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Sean Davis 26k

Ram · Answer 1 · 2014-09-08

Have a look at pan-genomes (also called core-genomes), the set of genes (or nucleotides) that is shared by all individuals or strains of a species.

This recent publication represents microbial pan-genomes in a compressed de Bruijn graph, which allows for SNP-calling/structural variants calling.

As a side-note: This paper from the same people compares rice cultivars, but with a different method. In more complex genomes such as humans or even plants, I believe we are not far enough into annotation and assembly methods, due to the high complexity of the genomes any differences in between strains or cultivars may well just be assembly artefacts you have to verify some other way.

Some interesting work has been done in E. coli, where the genomes are now (relatively) finished and well-annotated, Figure 4 here is nice.

If you google for 'genome compression', you'll find quite a few papers looking at novel ways to store genomes, either with loss of data, but little information (lossy such as MP3) or lossless (such as FLAC); a fun side-effect is that from compression artefacts you can infer novel insights on population structure etc., such as in this paper here.

There's probably a ton more I can't think of right now but these are interesting times!

Ram · Answer 2 · 2014-09-09

1

Entering edit mode

9.6 years ago

Sean Davis 26k

Deanna Church was nice enough to add this input via https://twitter.com/deannachurch.

Improved genome inference in the MHC using a population reference graph

Alexander Dilthey, Charles J Cox, Zamin Iqbal, Matthew R Nelson, Gil McVean

http://biorxiv.org/content/early/2014/07/08/006973

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Sean Davis 26k

Ram · Answer 3 · 2016-02-21

This popped up again to the top, and I think it's a pretty interesting question :)

It is difficult to say what format our genomes should/could be stored in. The full nucleotide sequence can't really be stored in a format more efficient that 2bit for every chromosome. A graph or VCF style "diff" file (where our genomes vary from a reference) would be a lot smaller, but it might be more trouble than its worth, since any 'personal analysis' will have to be with respect to a full reference genome anyway.

More importantly, although our genomes are all very similar - similar enough that a diff makes sense - our Epigenomes may not be.

You and I might both have identical copies of a 10kb gene, so nothing there in the VCF, but if our methylation levels across that gene are different, that is important info that will need to be encoded somehow. What is the reference human methylome? And when 99% of the genome now has extra information to encode, a diff file doesn't look like such a good strategy.

In all honesty, I think we will all have custom genome FASTAs, with custom annotations, and that will be one of our smallest raw datasets. Most of the data will come from multiple epigenomic marks & chromatin conformation, from multiple cell types throughout your body, for multiple time points throughout your life. As great as graph genomes are, they are here to solve a mapping problem; but with better sequencing tech around the corner, mapping will become trivial as reads get longer and sequencing more accurate. Graphs are also good right now for comparing genomes, as two genomes can be represented in the same graph, but theoretically speaking this is really no different from bundling one (or more) VCFs with a FASTA in the same file. It isn't a small and compact way to represent a personal genome.

So I suspect analysis will be done with machine learning algorithms trained on other people's personal genomes, and applied to our own, with no reference genomes being used by anyone ever. In fact the very idea of comparing different people/animals on a shared co-ordinate system will seem old-fashioned and archaic to bioinformaticians in 2030.

But that's just a guess. Maybe the VCF format gets further extended to encode all human biometric data, and at 02:14 on August 29th, the first VCF file becomes self-aware....