Question: Which Column Of A Vcf File Indicates The Reference Allele?
3
gravatar for thecuriousbiologist
7.5 years ago by
United States
thecuriousbiologist480 wrote:

Hi,

I have something like a VCF file as below :

 chr     position           A1      A2
    16      85955663        G       A
    16      85955671        A       G
    16      85955948        A       G

The first column is the chromosome number, the second column is the position, the third column is A1 and the fourth column is A2.

I am unable to figure out if A1 is the reference allele or if A2 is the reference allele.

Is there a way I can find this out ? These are human SNPs.

vcf allele reference • 4.4k views
ADD COMMENTlink modified 7.4 years ago by swbarnes27.5k • written 7.5 years ago by thecuriousbiologist480

did these come from a microarray or from sequencing? (could be TOP/BOT nomenclature)

ADD REPLYlink written 7.5 years ago by Jeremy Leipzig19k

These came from sequencing.

ADD REPLYlink written 7.4 years ago by thecuriousbiologist480
6
gravatar for Matt Shirley
7.5 years ago by
Matt Shirley9.3k
Cambridge, MA
Matt Shirley9.3k wrote:

If you know the genome build, you could download it from UCSC or NCBI and compare a few alleles from A1 and A2 to the reference to answer your question. If the file you have follows the VCF specification, then A1 is the REF allele:

 #CHROM POS     ID        REF    ALT     
20     14370   rs6054257 G      A       
20     17330   .         T      A       
20     1110696 rs6040355 A      G,T
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Matt Shirley9.3k

I tried doing this, however, for some of the positions, the reference nucleotide is not in any of the two columns. Does this mean that neither of the two columns are reference ?

ADD REPLYlink written 7.4 years ago by thecuriousbiologist480

This could mean that you have the wrong reference genome. How many of the other sites match the reference genome that you have?

ADD REPLYlink written 7.4 years ago by Matt Shirley9.3k

Well approximately only 15-20% of the sites match. I re-checked and I am sure I am using the correct reference genome. This sequencing experiment was done using hg19 build of the human genome and I have used the NCBI hg19 Reference genome for the comparison. Could there be some heterozygous mutations in the reference used for sequencing ?

ADD REPLYlink written 7.4 years ago by thecuriousbiologist480
1

Based on what you describe, are you sure that A1 and A2 are not the genotypes of individual samples? The VCF spec allows multiple individuals in one file. If you have more than one individual, there will be instances where the genotype of one of the samples will match the reference when the other sample is variant. There will also be positions where both samples share the same variant allele. Would this explain your A1 and A2 columns?

ADD REPLYlink written 7.4 years ago by Matt Shirley9.3k

Out of curiosity, which genotyper produced this format?

ADD REPLYlink written 7.4 years ago by Vivek2.4k
6
gravatar for Chris Miller
7.5 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

It's very possible that your file does not describe the reference allele at that position, but rather, gives the two alleles identified at that location. If you're looking at somatic mutations, most sites will be heterozygous, and your alleles will be the reference variant and the somatic mutation. (say, G/A). In other cases, you might see a homozygous mutation (A/A), or in rare cases, you might see two mutations at the same site (A/T).

You can use a reference fasta along with samtools faidx to quickly grab the reference allele at any given position, which may help you determine whether your first column is always the reference allele or not.

ADD COMMENTlink written 7.5 years ago by Chris Miller21k

Thanks for the reply. I am new to SNPs and so I had this very basic question. For the position 85955663, my reference from NCBI suggests that it should be "T". However, neither of the two columns are "T" for that position as can be seen from my question above. They are "G" and "A". Does this mean none of my columns is Reference ?

ADD REPLYlink written 7.4 years ago by thecuriousbiologist480
2
gravatar for Jeremy Leipzig
7.4 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

It is possible, even if this is sequencing related, that the variants are in the TOP/BOT nomenclature typically reserved for Illumina GoldenGate genotyping chips.

http://www.illumina.com/documents/products/technotes/technote_topbot.pdf

TOP/BOT, although I still cannot wrap my head around what it is supposed to do, will someday to be easily understood by aliens or future generations of humanoids-like organisms.

ADD COMMENTlink written 7.4 years ago by Jeremy Leipzig19k
0
gravatar for swbarnes2
7.4 years ago by
swbarnes27.5k
United States
swbarnes27.5k wrote:

In a real vcf, the columns are labeled "REF" and "ALT" so it's no mystery. I'd check with whoever prepared that table, becasue if NCBI says that the ref is neither of those letters, you might not be looking at the right reference, or the SNP calling was done against a slightly different reference.

While it's possible that the letters in your table are two alternate alleles, I don't think there are many points in the genome that are triallelic like that. You should not have a whole long list of such points.

ADD COMMENTlink written 7.4 years ago by swbarnes27.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1512 users visited in the last hour