Question: Help a SNP newbie out
0
gravatar for sarah.goldstein
5 months ago by
sarah.goldstein0 wrote:

Hi All,

I am new to SNP analysis. I have 12 genomes (putatively the same strain) that I have compared against a reference using Snippy (calls SNPs using Freebayes). I just did pairwise comparisons with snippy so I have 12 individual VCF files. I want to calculate Tajima's D for all of these genomes so do I need merge these VCFs using vcf-merge? Or some other method?

So far I have tried to tabix the 12 VCF files and attempted to merge them into one VCF (then sorted that VCF, did I need to do this?) which appears to run smoothly. However when I run vcftools --TajimaD I get the following error:

Parameters as interpreted:

--gzvcf JKH266_merged_sorter.vcf.gz --out TajimaD --TajimaD 100000

Using zlib version: 1.2.11
Warning: Expected at least 2 parts in INFO entry: ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in FORMAT entry: ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
Warning: Expected at least 2 parts in INFO entry: ID=SF,Number=.,Type=String,Description="Source File (index to sourceFiles, f when filtered)">
After filtering, kept 12 out of 12 Individuals
Outputting Tajima's D Statistic...
    TajimaD: Only using fully diploid sites.
    TajimaD: Only using bialleleic sites.
After filtering, kept 1084 out of a possible 1084 Sites
Run Time = 0.00 seconds

It looks like the header is corrupted on the VCF file. Is there anything I can do to fix this? Should I be going about this experiment in a completely different way?

Thanks!

ADD COMMENTlink modified 5 months ago by Pierre Lindenbaum122k • written 5 months ago by sarah.goldstein0
1
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

I'm pretty sure it's because your description fields contain some commas and vcftools split the INFO string using comma : https://github.com/vcftools/vcftools/blob/d657d60e37f5d705f9dbb578b516db6e420fb424/src/cpp/header.cpp#L112

It's more a bug with vcftools...

ADD COMMENTlink written 5 months ago by Pierre Lindenbaum122k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1149 users visited in the last hour