Question

Summarizing Personal Genomics Data From A Large Number Of Individuals

3

Entering edit mode

13.7 years ago

Allpowerde ★ 1.3k

HI, visualizing genomic data for an individual is pretty straight forward ( discussion here ). Having a few individuals can also still be managed, but once you have more individuals than can be neatly fitted as tracks on the screen it gets tricky.

In this case, one can summarize the data by saying "1000 individuals had this SNP, while 600 hat this one" or visualize it as SNP-hotspot tracks. But isn't there a better way to summarize/visualize it, especially given that one feature or data set is never enough. And You end up comparing X HapMap individuals with the individuals of the 1000genomes project on a multitude of features like SNPs, CNVs, SVs ...

Has anyone a good set of tools/concepts for this problem?

next-gen sequencing hapmap genome • 2.8k views

ADD COMMENT • link updated 13.2 years ago by Casbon ★ 3.3k • written 13.7 years ago by Allpowerde ★ 1.3k

score 3 · Answer 1 · 2010-08-12

There's no way to visualize everything. It would probably help to step back and ask yourself, "what exactly is the point of this study?" and "what questions are we trying to answer?". These should drive the type of visualization and analysis that you perform.

If these are all individuals with a specific disorder, you may be looking for unusual and recurrently altered genes or pathways. So find a way to highlight these genes. What about a simple plot, where you put the genomic coordinates on the x-axis and place the frequency of mutations in each gene on the y axis? That should let you easily identify highly-mutated genes.

What about doing some pathway analysis: which KEGG pathways or GO terms are overrepresented? Figures like this one can help summarize those relationships. Using something like Cytoscape, you can create heat maps, showing a whole pathway and coloring specific members according to how frequently they're altered.

Are you looking at structural information, identifying breakpoints of rearrangements? Circos is a nice tool for visualizing this, especially if there are intrachromosomal translocations.

Bottom line: pretty pictures are nice, but what's important is that they give you some insight into the system you're studying, so start there.

score 2 · Answer 2 · 2010-08-12

2

Entering edit mode

13.7 years ago

Lars Juhl Jensen 11k

I have done barely any work on such data, but the first thing that comes to my mind is "dimensionality reduction". There is no way that you can visualize the full data on 1000 individuals. Conversely, the type of summary statistics that you mention may throw away too much detail. I could imagine using methods such as principal component analysis, independent component analysis, or multi-dimensional scaling to capture as much of the data as possible in as few dimensions as possible.

Sorry that I cannot suggest anything more concrete than that.

ADD COMMENT • link 13.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

HI Lars, I probably did not make this clear in my question: I'm not talking about data analysis (e.g. association to find a candidate SNP for a disease). I'm just talking about taking stock of the data I have in the context of other data sets.

ADD REPLY • link 13.7 years ago by Allpowerde ★ 1.3k

score 2 · Answer 3 · 2010-08-12

2

Entering edit mode

13.7 years ago

Pierre Lindenbaum 161k

For this problem, I'm using a Key/Value datastore (BerkeleyDB) .

The key is a position on the genome
The value is an array of genotypes for 'N' individuals.

Using this table, you can quickly compare and query such large tables.

HDF5 is also an option AFAIK.

ADD COMMENT • link 13.7 years ago by Pierre Lindenbaum 161k

score 1 · Answer 4 · 2010-08-13

1

Entering edit mode

13.7 years ago

Casbon ★ 3.3k

http://browser.1000genomes.org/index.html

The 1kg browser is extending Ensembl to handle this level of variation. However, there doesn't seem to be much in the way of releases.

ADD COMMENT • link 13.7 years ago by Casbon ★ 3.3k