Question: How To Find The Most Frequent Allel Snp In Vcf File
1
gravatar for Chen
5.4 years ago by
Chen890
Chen890 wrote:

When I use samtools mpileup, there are usually several allel SNP in ALT column according to REF column. just as follows:

##fileformat=VCFv4.1
##samtoolsVersion=0.1.19-44428cd
##reference=file:///gpfs/home/cxs1031/backup/ref/chr22.fa
##contig=<ID=chr22,length=51304566>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=IS,Number=2,Type=Float,Description="Maximum number of reads supporting an indel and fraction of indel reads">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=QBD,Number=1,Type=Float,Description="Quality by Depth: QUAL/#reads">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Read Position Bias">
##INFO=<ID=MDV,Number=1,Type=Integer,Description="Maximum number of high-quality nonRef reads in samples">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. Note: this version may be broken.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    chr22_10x_sort.bam
chr22    16159158    .    A    G,C    123    .    DP=24;VDB=1.728111e-01;AF1=1;AC1=2;DP4=0,0,12,12;MQ=49;FQ=-96    GT:PL:DP:GQ    1/1:156,69,0,154,54,151:24:99

In the last row, A has 2 alts: G and C, how can I fingure out which alt is most frequent one?

vcf snp • 3.6k views
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Chen890

samtools mpileup generates phased genotypes ('1|2') ? how did you get this ? Which version of samtools are you using ?

ADD REPLYlink written 5.4 years ago by Pierre Lindenbaum121k

Last time, I use the sample data in vcf user guider: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 , this time, I paste my own result.

ADD REPLYlink written 5.4 years ago by Chen890

And my running command is : samtools mpileup -uDf ref.fa xxx.bam> xxx.bcf; bcftools view -gv xxx.bcf >xxx.vcf

ADD REPLYlink written 5.4 years ago by Chen890
1
gravatar for Chris Miller
5.4 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

The allele frequency is in your INFO field. For the variant in line 3 (20 1110696 rs6040355 A G,T), there are frequencies of 0.333 and 0.6667 for the two alleles. I'm not sure why samtools doesn't provide that for the last line as well - perhaps it doesn't calculate this number for insertions and deletions?

ADD COMMENTlink written 5.4 years ago by Chris Miller21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1086 users visited in the last hour