3.1 years ago
BAGeno ▴ 180

Does any one know on what base does PharmGKB decide haplotype reference allele?. I checked different snps in dbsnp the reference alleles are not same in Pharmgkb haplotype table and that of dbsnp like rs3818247  reference allele in dbsnp is G but in PharmGKB haplotype table of HNF4A reference allele is T

3.1 years ago

I would try to avoid thinking along the lines of reference / alternate (REF / ALT) bases when using PharmGKB data. The data in PharmGKB is curated from published literature and just lists the known genotype of each haplotype. These genotypes may or may not be the reference base in a given reference genome build. The only thing that you need to know, however, is the exact bases that form the haplotypes.

Note that a rs ID from dbSNP should neither be thought along the lines of REF / ALT and should more be regarded in the sense that a particular rs ID relates to a position in the genome whose genotype varies among individuals. For all intents and purposes, REF / ALT is meaningless and even the 'reference' base may be pathogenic. Please, read my answer here to be enlightened, in this regard: A: Alternate nucleotide is more frequent than reference nucleotide. OMG I'm dizzy.

Here is the haplotype table for HNF4A from PharmGKB:

All you need to worry about are those bases listed for each haplotype, irrespective of whether they are the REF or ALT in a given reference genome. Thus, when processing your own data and matching to these haplotype tables, you simply need to know the base at each relevant position in your sample. Technically, you don't even need to use a variant caller.

@Kevin I was wondering what about position in which no variant is found. In processing my data I found all variants from above mentioned haplotype except rs3212200. Does this means that on location in which this variant is found there was reference allele which was T?. As without considering uncalled variants I cannot do haplotype analysis.

Well, that is possibly the mistake that you are making. You should not be looking for variants in your samples, i.e., simply running a variant caller to look for variant alleles is not sufficient / relevant for this type of work. You should just be looking for the base at each relevant position, irrespective of whether it's a REF or ALT in relation to the genome build that you're using. This can be done outside of a variant caller, or, by using a variant caller, you just need to configure it to output the genotype at each relevant site, even if it's a reference base. Once you then have your genotypes over each position, you can then easily compare to the PharmGKB haplotype tables.

Do not just go by rs ID because rs IDs are in no way a unique representation of a base at a given position. Many rs IDs are even duplicated in the same reference build.

I have only have phased vcf as I did not called variants so I does not have genotype for every position. Is there no way to do this kind of work with vcf?

You don't have the BAM file or even FASTQ / FASTA? Do you know the reference genome build (likely recorded in the VCF header)?

If you know the reference genome that was used and there is no variant listed in your VCF for a given position of interest, then you can infer beyond reasonable doubt that the genotype at the position is the REF base. You can pull the ref base from the reference genome itself (may require downloading), or just literally look on, for example, UCSC Genome Browser (if it's not too much work).