how to add reference alleles to VCF?
20 months ago
dec986 ▴ 330

I’m converting gVCFs to VCF, but the reference alleles are missing. An example below:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  180525_FD02929177
1   97547947    .   T   .   .   .   DP=31   GT:DP:RGQ   0/0:31:81
1   97915614    .   C   .   .   .   DP=40   GT:DP:RGQ   0/0:40:99
1   97981343    .   A   .   .   .   DP=43   GT:DP:RGQ   0/0:43:99
2   234668570   .   C   T   539.64  .   AC=1;AF=0.500;AN=2;ClippingRankSum=0.340;
DP=32;ExcessHet=3.0103;FS=5.748;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=16.86;RAW_MQ=115200.00;SOR=0.150    G
2   234669144   .   G   .   .   .   DP=36   GT:DP:RGQ   0/0:36:99


break_blocks --region-file /illumina/runs/con/concordance/fluidigm/fluidigm_positions.tab.bed --ref human_g1k_v37.fasta --exclude-off-target


I’m using GATK thus:

gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R /illumina/runs/con/g1k_v37/human_g1k_v37.fasta \
-V fluidigm.gvcf.202009/HG00099.fluidigm.202009.g.vcf \
-O fluidigm.vcf.202009/HG00099.fluidigm.202009.vcf \
--allow-old-rms-mapping-quality-annotation-data \
--include-non-variant-sites


But none of the options in GATK seem to allow adding reference alleles to the REF column, everything is just .. When I try this manually with a Perl script, there are missing data, so programming it myself can’t work.

Do you know how I can add the reference alleles to VCF/gVCF?

Heys! I'm having the exact same problem! Did you solve it? I would really appreciate it!

20 months ago
Ram 36k

I don't see any entry with a missing REF. Could it be that you're visually matching the ID column in the header to the REF column in the data?

See below:

#CHROM  POS        ID  REF  ALT  QUAL    FILTER  INFO               FORMAT          180525_FD02929177
1       97547947   .   T    .    .       .       DP=31              GT:DP:RGQ       0/0:31:81
1       97915614   .   C    .    .       .       DP=40              GT:DP:RGQ       0/0:40:99
1       97981343   .   A    .    .       .       DP=43              GT:DP:RGQ       0/0:43:99
2       234668570  .   C    T    539.64  .       AC=1;AF=0.500;...  GT:AD:DP:GQ:PL  0/1:17,15:32:99:547,0,586
2       234669144  .   G    .    .       .       DP=36              GT:DP:RGQ       0/0:36:99

How is that the quality column is empty although there is coverage for it? How could you add that information Ram ? I would really appreciate some help!

That's a different question - please search the forum and if you don't find a satisfactory answer, open a new question.

Given that this data is from a different person, odds are only they'll know why the QUAL doesn't have data - it's probably an accepted norm in gVCF. (I did a quick google and see something pertinent on this link: https://support.illumina.com/help/BS_App_TSA_help/Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_gVCF.htm)