Question

Concatenating alleles at detected sites to form fasta sequences from vcf files

1

Entering edit mode

6.9 years ago

Mohak ▴ 20

I have a multi-vcf file with various samples of a bacterial genome, in the following format,

NC_num     20      .       A       T       66      .       DP=1850;VDB=0.015032;SGB=16.4938;RPB=0.0719749;MQB=0.782089;MQSB=0.998951;BQB=0.976407;MQ0F=0;AC=1;AN=32;DP4=1599,229,6,0;MQ=59 GT:PL   0:0,255 0:0,255 .:0,0   0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 1:126,17 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,63  0:0,255 0:0,255 0:0,255**

NC_num  5232949 .   C   G,T 999 .   DP=3540;VDB=0.267565;SGB=410.948;RPB=0.979846;MQB=0.999905;MQSB=0.99953;BQB=0.999963;MQ0F=0;AC=24,5;AN=33;DP4=268,289,1378,1322;MQ=59   GT:PL   1:255,0,255 1:255,0,255 1:37,0,37   2:255,255,0 1:255,0,255 1:255,0,255 1:255,0,255 1:255,0,255 0:0,255,255 1:255,255,255   1:255,0,255 1:255,0,255 1:255,0,255 1:74,0,74   1:255,0,255 1:255,0,255 1:255,0,255 0:0,255,255 1:255,0,255 2:255,255,0 1:255,0,255 2:255,255,0 2:255,255,0 1:255,0,255 1:255,0,255 0:0,255,255 1:255,0,255 2:255,255,0 1:255,0,255 1:150,0,150 1:255,0,255 1:255,0,255 0:0,255,255

NC_num  5233099 .   C   T   212 .   DP=3744;VDB=0.000870234;SGB=21.4718;RPB=0.848604;MQB=0.000274995;MQSB=0.995943;BQB=0.953967;MQ0F=0;AC=1;AN=33;DP4=1869,1811,8,14;MQ=59  GT:PL   0:0,255 0:0,255 0:0,38  0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255  0:0,255    0:0,255 0:0,255 1:255,0 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255    0:0,255  0:0,255 0:0,255 0:0,255 0:0,255

I have to concatenate only the variant sites detected in the vcf file for each bacterial sample in a fasta format by just stitching all the common SNPs to do an MSA downstream. I am stuck at concatenating the SNPs per sample. For example in the above bold line, REF=A and ALT=T, then in some samples it's retained as an A and others it might be a T. Since it is a haploid, GT of 0 means that in that sample there is an A (same as ref) and GT = 1 means that the sample has a T. Similarly, PL 0,255 means that it's more likely to have the REF allele and a PL 126,17 means that it's more likely to have the ALT allele. And lastly, .:0,0 means no detection of any allele at this position in this particular sample. (AN= 32 i.e out of 33 samples only 32 have an allele detection)

To put it simply, out of all the samples only one has a T and one has no allele detected and others have an A. Am I correct? Is there a tool which does the concatenation per sample for only these SNPs in a fasta format ( I don't want to incorporate these SNPs in any reference sequence instead)?

Sample1 
ACC

Sample2
AGT

SampleN
TTC

etc...

SNP MSA VCF sequence alignment • 1.4k views

ADD COMMENT • link updated 6.9 years ago by Devon Ryan 104k • written 6.9 years ago by Mohak ▴ 20