Merging Vcf Files And Filling In Missing Genotypes
2
7
Entering edit mode
13.4 years ago
Allpowerde ★ 1.3k

I have vcf files with genomic variants (e.g. SNPs and Indels) from several patients. I now want to merge them into one vcf file e.g. using vcftools' merge-vcf. Doing that I get a file that holds for the union of all variants observed amongst the merged files the data for each individual in the columns:

#CHROM  POS  ID  REF   ALT     QUAL    FILTER  INFO    FORMAT  Ind1 Ind2 Ind3
chr1    583  .   G     A       8.44    .       <..>    <..>    1/1  1/0  ./.

However, if there was no variant observed for an individual (because it was in agreement with the reference), it won't have information for this particular position and a "./." is written instead. Now, instead of "./." I want to have the actual genotype printed for the specific variant:

#CHROM  POS  ID  REF   ALT     QUAL    FILTER  INFO    FORMAT  Ind1 Ind2 Ind3
chr1    583  .   G     A       8.44    .       <..>    <..>    1/1  1/0  0/1

So all I need to do is go back to the bam file and lookup the genotype of this SNP or Indel.

Is anyone of you aware of a tool that fills in the missing genotypes by looking up the alignment file (e.g. GATK does it for SNPs but not Indels).

vcf indel snp merge • 12k views
ADD COMMENT
6
Entering edit mode
13.4 years ago

Correct me if I am wrong, but the VCF format definition states that an allele matching the reference is represented by 0, so if no variant was observed for an individual it means that both alleles would match the reference, hence they should be presented as 0/0 (the / stands for unphased genotype).

I guess that the merge-vcf tool does not want to make you think that these positions were typed, and for that reason it uses the ./. code when merging instead of the appropriate 0/0. it should be then straightforward to replace all ./. with 0/0 by simple plain text pattern replacement.

ADD COMMENT
4
Entering edit mode
13.4 years ago
Jashapiro ▴ 230

The problem is that a simple replacement with the reference calls may not be accurate. The SNP caller may have decided that a given call was too low quality/uncertain to report (a het in the example above), but you might want to keep that info if you know there is a SNP at the position from other individuals. mpileup in samtools can call specific bases with the -l or -r options, and it can be used to recall the entire set of individuals, incorporating the information from all of them to decide which variants are most likely to be real.

ADD COMMENT

Login before adding your answer.

Traffic: 2858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6