Help a SNP newbie out
1
0
Entering edit mode
24 months ago

Hi All,

I am new to SNP analysis. I have 12 genomes (putatively the same strain) that I have compared against a reference using Snippy (calls SNPs using Freebayes). I just did pairwise comparisons with snippy so I have 12 individual VCF files. I want to calculate Tajima's D for all of these genomes so do I need merge these VCFs using vcf-merge? Or some other method?

So far I have tried to tabix the 12 VCF files and attempted to merge them into one VCF (then sorted that VCF, did I need to do this?) which appears to run smoothly. However when I run vcftools --TajimaD I get the following error:

Parameters as interpreted:

--gzvcf JKH266_merged_sorter.vcf.gz --out TajimaD --TajimaD 100000

Using zlib version: 1.2.11
Warning: Expected at least 2 parts in INFO entry: ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in FORMAT entry: ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
Warning: Expected at least 2 parts in INFO entry: ID=SF,Number=.,Type=String,Description="Source File (index to sourceFiles, f when filtered)">
After filtering, kept 12 out of 12 Individuals
Outputting Tajima's D Statistic...
    TajimaD: Only using fully diploid sites.
    TajimaD: Only using bialleleic sites.
After filtering, kept 1084 out of a possible 1084 Sites
Run Time = 0.00 seconds

It looks like the header is corrupted on the VCF file. Is there anything I can do to fix this? Should I be going about this experiment in a completely different way?

Thanks!

SNP genome sequencing software error Assembly • 830 views
ADD COMMENT
1
Entering edit mode
24 months ago

I'm pretty sure it's because your description fields contain some commas and vcftools split the INFO string using comma : https://github.com/vcftools/vcftools/blob/d657d60e37f5d705f9dbb578b516db6e420fb424/src/cpp/header.cpp#L112

It's more a bug with vcftools...

ADD COMMENT

Login before adding your answer.

Traffic: 1395 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6