I'm starting a project where I need to use the sequence for a number of individuals so I have been given a number of large VCF files which I understand contain the differences from the reference genome. What I want to do is to recreate the sequence for each individual for particular genes - Is there already software available that will do this?
In any case I believe I understand the initial portion which describes what's present in the initial reference genome (e.g. A), it's position, and the potential alternatives (e.g. G,T), my confusion comes in when I look at the data describing the individuals.
If we take the example on line 3 here http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ (chrom 20, position 1110696, reference A , alternatives G,T ) It describes the personal information in a format of GT:GQ:DP:HQ, so for individual NA00001 this is 1|2:21:6:23,27. When I looked up information on the 1|2 portion my understanding is that these represents two alleles for this position, but in order to reconstruct, which should I be using in the sequence? For example if the reference geneome looks like this
"...ATGT A CTGA..." where the A here is at position 1110696, how would the genome for an individual look?
Is it the case that it would look like both "...ATGT G CTGA..." and "...ATGT T CTGA..." ? and is there a way in which I can determine which is the "correct" sequence to use?
I really appreciate any help people can offer me on this. Thanks.