Question: Concatenating alleles at detected sites to form fasta sequences from vcf files
0
gravatar for Mohak
2.2 years ago by
Mohak0
Mohak0 wrote:

I have a multi-vcf file with various samples of a bacterial genome, in the following format,

NC_num     20      .       A       T       66      .       DP=1850;VDB=0.015032;SGB=16.4938;RPB=0.0719749;MQB=0.782089;MQSB=0.998951;BQB=0.976407;MQ0F=0;AC=1;AN=32;DP4=1599,229,6,0;MQ=59 GT:PL   0:0,255 0:0,255 .:0,0   0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 1:126,17 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,63  0:0,255 0:0,255 0:0,255**

NC_num  5232949 .   C   G,T 999 .   DP=3540;VDB=0.267565;SGB=410.948;RPB=0.979846;MQB=0.999905;MQSB=0.99953;BQB=0.999963;MQ0F=0;AC=24,5;AN=33;DP4=268,289,1378,1322;MQ=59   GT:PL   1:255,0,255 1:255,0,255 1:37,0,37   2:255,255,0 1:255,0,255 1:255,0,255 1:255,0,255 1:255,0,255 0:0,255,255 1:255,255,255   1:255,0,255 1:255,0,255 1:255,0,255 1:74,0,74   1:255,0,255 1:255,0,255 1:255,0,255 0:0,255,255 1:255,0,255 2:255,255,0 1:255,0,255 2:255,255,0 2:255,255,0 1:255,0,255 1:255,0,255 0:0,255,255 1:255,0,255 2:255,255,0 1:255,0,255 1:150,0,150 1:255,0,255 1:255,0,255 0:0,255,255

NC_num  5233099 .   C   T   212 .   DP=3744;VDB=0.000870234;SGB=21.4718;RPB=0.848604;MQB=0.000274995;MQSB=0.995943;BQB=0.953967;MQ0F=0;AC=1;AN=33;DP4=1869,1811,8,14;MQ=59  GT:PL   0:0,255 0:0,255 0:0,38  0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255  0:0,255    0:0,255 0:0,255 1:255,0 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255    0:0,255  0:0,255 0:0,255 0:0,255 0:0,255

I have to concatenate only the variant sites detected in the vcf file for each bacterial sample in a fasta format by just stitching all the common SNPs to do an MSA downstream. I am stuck at concatenating the SNPs per sample. For example in the above bold line, REF=A and ALT=T, then in some samples it's retained as an A and others it might be a T. Since it is a haploid, GT of 0 means that in that sample there is an A (same as ref) and GT = 1 means that the sample has a T. Similarly, PL 0,255 means that it's more likely to have the REF allele and a PL 126,17 means that it's more likely to have the ALT allele. And lastly, .:0,0 means no detection of any allele at this position in this particular sample. (AN= 32 i.e out of 33 samples only 32 have an allele detection)

To put it simply, out of all the samples only one has a T and one has no allele detected and others have an A. Am I correct? Is there a tool which does the concatenation per sample for only these SNPs in a fasta format ( I don't want to incorporate these SNPs in any reference sequence instead)?

Sample1 
ACC

Sample2
AGT

SampleN
TTC

etc...

snp msa alignment sequence vcf • 572 views
ADD COMMENTlink modified 2.2 years ago by Devon Ryan91k • written 2.2 years ago by Mohak0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1117 users visited in the last hour