Question

Modelling The Conservation Of Genome Positions

2

Entering edit mode

13.4 years ago

Andrea_Bio ★ 2.8k

Hi

I don't know if this is an appropriate question for this forum. If not, I apologise in advance and won't ask questions of this nature again. If it is appropriate I look forward to an interesting answer.

Does anyone have any interesting ideas for how you might model in a database the conservation of genome positions in a genome. The obvious first thought is a table called Locus with one row per base in the genome and fields called Position, Chromosome and conservationScore.

That's a large database table for the human genome but can be indexed and should be a breeze for any enterprise db like mysql. However I was interested to learn any other perspectives and approaches. Sometimes when the solution is so obvious you don't think laterally enough.

conservation • 2.4k views

ADD COMMENT • link updated 13.3 years ago by biobot 0.0.77.a.1099 6.2k • written 13.4 years ago by Andrea_Bio ★ 2.8k

0

Entering edit mode

Using a database implies that you want to do relational queries - is that your aim? Can you give an example of the biological question you want to answer? Are you expecting data for every base? What is the thing each base is compared with to get conservationScore?

ADD REPLY • link 13.4 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

at present the conservation scores will just be inserted into the database and then used by the biologist to assess the importance of a snp at this position. my other data is in a relational database but I'm not averse to have obtaining a locus for a snp fro a database and then using that as a hook to get conservation scores from another format if that format is smaller and more efficient.

ADD REPLY • link 13.4 years ago by Andrea_Bio ★ 2.8k

0

Entering edit mode

I haven't looked into the availability of the data thoroughly but I will use the human genome to start with I think. I was thinking of phylop,phastcon and gerp scores but I don't know enough about how these are calculate to know if I should expect correlation. It appears not from this post: Correlation Between Genome Conservation Scores (Phastcons Vs. Phylop)?

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.4 years ago by Andrea_Bio ★ 2.8k

score 6 · Answer 1 · 2010-11-25

6

Entering edit mode

13.4 years ago

Mitch Skinner ▴ 660

For large amounts of data, there's also UCSC's bigwig file format:

http://bioinformatics.oxfordjournals.org/content/26/17/2204.full

It's not in a database, but it is indexed, and it can also include also higher-level (more zoomed-out) summarized versions of the data.

ADD COMMENT • link 13.4 years ago by Mitch Skinner ▴ 660

score 4 · Answer 2 · 2010-11-25

I would look at how the UCSC genome browser does it (start at the Table Browser, choose Comparative Genomics as the Group, and hit the describe table schema button). There's an extensive description of how they put together the conservation tracks there, and you can pull down the schema (with data if you like) to see how they implemented things.

score 3 · Answer 3 · 2010-11-28

3

Entering edit mode

13.4 years ago

biobot 0.0.77.a.1099 6.2k

I've voted for both the UCSC schema and bigwig answers here. My main concerns with the table you've suggested are that if your data are sparse you'll waste a most of the rows, probably need nullable columns (e.g. score) and that your table doesn't have a column to indicate what the similarity is to, or what method generated the similarity measurement.

The first thing that I'd do is look at existing systems, such as UCSC and Ensembl. See the Ensembl Variation database described in this paper.

A technique that I've used is to create a table with GIS (geographic information system) indexing to express chromosomal positions or ranges. GIS and spatial indices are meant for geographic location queries, but work equally well for genomic coordinates.

When I have genuine continuous data with a single value per base across a whole genome, I often simply store it in a single memory-mapped vector per chromosome.

ADD COMMENT • link 13.4 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

i didn't list all of the columns for brevity. there could be additional columns depending on the nature of the data. I used to work in GIS systems. I'm not that familiar with the conservation scores. It sounds like you are suggesting that you could have a score that applies to a rang of genome coordinates rather than individual bases. I appreciate the point about lots of null rows but couldn't i just 'not insert' a row at that base position if data wasn't available? I could use left outer joins to make sure any joins worked

ADD REPLY • link 13.4 years ago by Andrea_Bio ★ 2.8k

0

Entering edit mode

A SNP score will apply to a single base, but small indels scores apply to ranges. Whether you need to deal with ranges will depend on your type(s) of similarity data. I would omit rows where there were no data, then the problem of nulls goes away and your table collapses to a very manageable size.

ADD REPLY • link 13.4 years ago by biobot 0.0.77.a.1099 6.2k