Is There An Existing Data Format Suitable For Holding Dna Sequence Variations In A Population?
3
6
Entering edit mode
12.3 years ago
Nickengland ▴ 130

If I have multiple genomic DNA sequences for a small protein, and want to represent this variation in a file, are there any existing formats to do this?

I could save all the sequences one per line and re-calculate the variation information each time, but this is a waste of computational resources.

Before I create YASF (yet-another-sequence-format) I was wondering if anyone knew of an existing one?

It should ideally be able to represent A|C|G|T with an optional 3 or 4 floats for the relative abundance of each. I wouldn't want to store the floats if there was no variation at any particular point in the sequence.

If it could handle gaps/insertions that would be useful too!

[Edit] I should have mentioned that the proteins in question are going to be antibodies, and so the data will consist of large numbers of different sequences based on similar VDJ recombination building-blocks with somatic hypermutation providing a vast number of similar, yet different, DNA sequences.

format • 2.3k views
ADD COMMENT
6
Entering edit mode
12.3 years ago

Perhaps MAF http://www.bioperl.org/wiki/MAF_multiple_alignment_format is applicable? VCF (variant call format) is also useful for this type of information. Some multiple alignment software outputs might be useful. Position Specific Weight Matrices might be applicable. It all depends a bit on what you want to do with the data and from where the data are coming.

ADD COMMENT
1
Entering edit mode

I think VCF is going to be what you want to use here

ADD REPLY
0
Entering edit mode

I think MAF is going to be more appropriate than VCF, as the data I will be using will have ambigious/unknown positional information in the chromosome, as random mucleotides are inserted or deleated during B-cell development.

ADD REPLY
2
Entering edit mode
12.3 years ago

If you need to create a new format, use a RDF file with an already defined ontology (e.g: http://variationontology.org/ ).

You can then query your data with SPARQL , transform the data using XSLT, share...

ADD COMMENT
2
Entering edit mode
12.3 years ago

How about the Genome Variation Format: http://www.sequenceontology.org/resources/gvf.html

ADD COMMENT

Login before adding your answer.

Traffic: 2495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6