Question

Help a SNP newbie out

0

Entering edit mode

5.0 years ago

sarah.goldstein • 0

Hi All,

I am new to SNP analysis. I have 12 genomes (putatively the same strain) that I have compared against a reference using Snippy (calls SNPs using Freebayes). I just did pairwise comparisons with snippy so I have 12 individual VCF files. I want to calculate Tajima's D for all of these genomes so do I need merge these VCFs using vcf-merge? Or some other method?

So far I have tried to tabix the 12 VCF files and attempted to merge them into one VCF (then sorted that VCF, did I need to do this?) which appears to run smoothly. However when I run vcftools --TajimaD I get the following error:

Parameters as interpreted:

--gzvcf JKH266_merged_sorter.vcf.gz --out TajimaD --TajimaD 100000

Using zlib version: 1.2.11
Warning: Expected at least 2 parts in INFO entry: ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in FORMAT entry: ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
Warning: Expected at least 2 parts in INFO entry: ID=SF,Number=.,Type=String,Description="Source File (index to sourceFiles, f when filtered)">
After filtering, kept 12 out of 12 Individuals
Outputting Tajima's D Statistic...
    TajimaD: Only using fully diploid sites.
    TajimaD: Only using bialleleic sites.
After filtering, kept 1084 out of a possible 1084 Sites
Run Time = 0.00 seconds

It looks like the header is corrupted on the VCF file. Is there anything I can do to fix this? Should I be going about this experiment in a completely different way?

Thanks!

SNP genome sequencing software error Assembly • 1.6k views

ADD COMMENT • link updated 5.0 years ago by Pierre Lindenbaum 161k • written 5.0 years ago by sarah.goldstein • 0

score 1 · Answer 1 · 2019-04-19

1

Entering edit mode

5.0 years ago

Pierre Lindenbaum 161k

I'm pretty sure it's because your description fields contain some commas and vcftools split the INFO string using comma : https://github.com/vcftools/vcftools/blob/d657d60e37f5d705f9dbb578b516db6e420fb424/src/cpp/header.cpp#L112

It's more a bug with vcftools...

ADD COMMENT • link 5.0 years ago by Pierre Lindenbaum 161k