Calling heterozygous and homozygous SNPs with SAMtools or VCF ?
1
3
Entering edit mode
7.9 years ago

When calling SNPs with samtools, I need to know which one is Hetero or Homozygote.

I tried the following code, but I have to look at the 1/1 and it doesn't give me the allele count or Percent:

$samtools mpileup -ugf fasta.fa aln.bam | bcftools view -bvcg - > var.raw.bcf$ bcftools view var.raw.bcf | vcfutils.pl varFilter -D1000000 > var.flt.vcf


I'd like to output a file that has 8 columns such as the picture below:

Is there any way to output a file with 8 columns like that one?

vcf heterozygous bam samtools SNP • 7.7k views
2
Entering edit mode
7.9 years ago

The VCF file contains all the information you are looking for. Bcftool makes your life simpler by predicting the genotype for you rather than you looking at the counts and predicting if a SNP is homozygous or heterozygous.

Go through these page to know more about VCF format: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

VCF format specific to variants produced by samtools: http://samtools.sourceforge.net/samtools.shtml#6

The "INFO" column in the vcf file contains most of the information you are looking for. Below is the detailed description that explains each TAG. For allele count you need to retrieve information from DP4 tag. The sum of first 2 values will represent the reference allele counts and the last two values will represent the alternate allele counts (diploid organism).

##samtoolsVersion=0.1.18 (r982:295)
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description=&quot;Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
##INFO=<ID=PASS,Number=0,Type=Flag,Description="Variants that passed filtering (step1) at Samtool level">
##INFO=<ID=COMMON_PASS_FAIL,Number=0,Type=Flag,Description="Variants that passed filtering at Samtool level (step1) but failed at GATK level (step2)">
##INFO=<ID=COMMON_PASS_PASS,Number=0,Type=Flag,Description="Variants that passed filtering at Samtool level (step1)and GATK level (step2)">
##INFO=<ID=DUAL_VARIANT_PASS_FAIL,Number=0,Type=Flag,Description="Position called as both SNP and Indel and post-filtering eliminated one effect based on some criteria">
##INFO=<ID=DUAL_VARIANT_PASS_PASS,Number=0,Type=Flag,Description="Position called as both SNP and Indel and post-filtering approved both of them. Must be used with caution.">