I am going to write a somewhat indirect answer to your question - I am going to reframe it a bit for you. I hope that this will be helpful in the long run, even if I don't tackle your specific question.
The first thing I'd like to ask you to do is think about nucleotide variants that have differing allele frequency across human populations. For instance, let's say rs12345 has allele frequency 0.31 C 0.69T in Western European populations, but AF: C 0.82 and T 0.18 in Yorubans. Although you used the word reference, the first thing I want to do in this case is point out the meaninglessness of the terms "major" and "minor" allele.
Reference and alternate, as you have stated above, I think is a better nomenclature ... but here's the kicker... the more you think about individual instantiations of human genomes and all the diversity of our species, the less the concept of a "reference" makes means ... There are literally thousands of HLA class I alleles, and the purpose of this diversity is to ensure the survival of the species from myriad pathogenic threats. Why would we call one of these a reference?
Shortcomings of the idea of a reference genome:
The reference is missing hundreds of millions of bp of sequence. Our reference genome is based on only 2 Caucasian people. Because these people are of European descent and thus went through a population bottleneck, our reference genome necessarily does not reflect the totality of what is in the larger (human) population. If we still believe the oft quoted "statistic" that "we are 99.9% the same" it might make sense to use one, but recent results don't support that idea. Though there are some reasonable criticisms of this manuscript in Nature Genetics]1, I think nevertheless the authors make some fair points, including about how much information may be missing from our "reference".
The use of a reference genome that is strongly skewed towards one ancestry creates problems in variant calling algorithms. This is technical, but the gist is that the callers lend more credence to variants that are known to exist. This means if European genomes are the reference and the most data have been generated on individuals of European and East Asian descent, then we get better quality results when sequencing individuals more similar to those populations.
The need to have a reference genome persists due to bioinformatic necessity, not biologic utility What I mean is, as long as our sequencing technology is dominated by techniques that massively amplify short reads and then align them, then it will seem necessary to align them to something, and that might as well be called a reference. I am not trying to claim that there are no biologists who have benefited from using the "reference genome model", so to speak, it was an OK starting place. But it is a grossly limited concept that I think lends itself to inadequate appreciation of how variable genomes really are not to mention what the purpose of that diversity is... I'll try to explain in 4.
There are better ideas for how to represent genetic information out there already. There's a growing body of literature that uses Alignment-free
read mapping. There are also emerging ways of representing the genome on the horizon; put for example "Genome Graphs.
" into Google or Pubmed. Social Justice and such aside (viz. point 2) I think if you look at images of genome graphs and reflect on the points above, you'll conclude as I have that reference genomes quite simply contain less information about biology.
Back to your question: Anyway, you didn't want to read any of this. You just wanted to know how to represent something in a darn v4.4 VCF file. I can respect that. But I took time out to write this because, in retrospect, it would have saved me time to explode the idea of a reference from the outset. It's a quaint, antiquated historical artifact that must not limit our view of what genetics is, and must not be allowed to negatively impact social justice in genomic research and clinical practice.
Good luck!!!! VL