Question: How To Get Consensus Sequence Including Variant From A Vcf File For Several Individuals...?
1
gravatar for Bioch'Ti
8.4 years ago by
Bioch'Ti1.0k
France (Avignon)
Bioch'Ti1.0k wrote:

Hi members,

I have a VCF file (from GATK) containing variants for a total of 20 individuals and I'm wondering how to get the consensus sequences for each individual regarding its own polymorphism. Some individuals may not show polymorphism at a particular position in a contig whereas some others may. I've checked the GATK dedicated tool (FastaAlternateReferenceMaker) but it doesn't answer my question as only one consensus is generated. My requirement would be to get as many outputs files (containing consensus file) as mapped individuals.

Do any of you faced a similar question?

Thanks for your reply, Best, C.

vcf consensus calling variant gatk • 8.3k views
ADD COMMENTlink modified 3.0 years ago by boczniak767700 • written 8.4 years ago by Bioch'Ti1.0k

what do you mean with "consensus sequence" ? how is it different from the REF/ALT columns ? can you show us a few rows of your VCF ?

ADD REPLYlink written 8.4 years ago by Pierre Lindenbaum134k

By consensus sequence, for each individual, I mean that I would like to obtain one sequence per individual that contain all the variant site included in the reference sequence used for mapping. Maybe the use of 'consensus' is confusing.

Here is a subset of my VCF, showing the first two variant sites for 20 individuals:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SC1 SC10 SC2 SC3 SC4 SC5 SC6 SC7 SC8 SC9 SS1 SS10 SS2 SS3 SS4 SS5 SS6 SS7 SS8 SS9

Solyc00g005840.2.1271.CT6263.99PASS"AC=4;AF=0.100;AN=40;BaseQRankSum=-1.639;DP=2348;DS;Dels=0.00;FS=3.462;HRun=0;HaplotypeScore=4.1294;InbreedingCoeff=1.0000;MQ=58.30;MQ0=0;MQRankSum=2.601;QD=38.20;ReadPosRankSum=3.569"GT:AD:DP:GQ:PL0/0:197,0:198:99:0,587,76611/1:5,78:83:57.67:2803,58,00/0:108,0:108:99:0,322,41910/0:132,0:132:99:0,394,51331/1:1,79:81:99:3088,205,00/0:99,0:100:99:0,295,38650/0:83,0:83:99:0,244,31820/0:51,0:51:99:0,150,19780/0:42,0:43:99:0,123,16430/0:91,0:91:99:0,268,34710/0:131,0:132:99:0,367,48640/0:178,0:178:99:0,533,70100/0:73,0:73:99:0,220,29080/0:112,0:112:99:0,337,44360/0:147,0:147:99:0,436,56900/0:154,0:154:99:0,460,61490/0:153,0:153:99:0,457,60740/0:221,0:221:99:0,656,85990/0:123,0:123:99:0,364,48320/0:84,0:85:99:0,253,3360

Solyc00g005840.2.1429.CG2912.52PASS"AC=1;AF=0.025;AN=40;BaseQRankSum=2.400;DP=1933;Dels=0.00;FS=0.812;HRun=0;HaplotypeScore=2.3271;InbreedingCoeff=0.9264;MQ=58.33;MQ0=0;MQRankSum=1.855;QD=29.72;ReadPosRankSum=-1.158"GT:AD:DP:GQ:PL0/0:147,0:147:99:0,436,57010/0:59,0:59:99:0,175,22210/0:91,0:91:99:0,274,36280/0:99,0:99:99:0,292,37450/0:78,0:79:99:0,229,29470/0:80,0:80:99:0,238,31650/0:71,0:71:99:0,208,27590/0:55,0:55:99:0,166,21910/0:33,0:33:96.31:0,96,12830/0:60,0:60:99:0,181,23840/1:7,91:98:0.61:2963,8,00/0:144,0:144:99:0,427,56320/0:73,0:73:99:0,220,28970/0:91,0:91:99:0,271,36080/0:121,0:121:99:0,355,46200/0:130,0:130:99:0,385,51580/0:120,0:120:99:0,361,48260/0:172,0:172:99:0,509,67550/0:108,0:108:99:0,325,43260/0:102,0:102:99:0,307,4136
ADD REPLYlink modified 8.4 years ago by Istvan Albert ♦♦ 86k • written 8.4 years ago by Bioch'Ti1.0k
2
gravatar for Bioch'Ti
8.4 years ago by
Bioch'Ti1.0k
France (Avignon)
Bioch'Ti1.0k wrote:

Hi Guys,

An answer was proposed on the GATK Forum, here it is:

Use SelectVariants to get one VCF file per individual and then run FastaAlternateReferenceMaker to generate the sequences on each of these individual VCF file ! I checked using a ClustalW alignment and found my SNPs back at the right position depending of the individual. The drawback may be when dealing with a high number of individuals (have to repeat the command). Then, merge the individual fasta files according to the purpose (e.g. software input).

http://gatkforums.broadinstitute.org/discussion/1654/fastaalternatereferencemaker-for-several-individuals

Hope this helps ! C.

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Bioch'Ti1.0k
0
gravatar for Rubal7
8.4 years ago by
Rubal7770
Rubal7770 wrote:

Do you know any programming? You could read through each row of the VCF file and take the allele that appears most frequently on each line. Currently I don't know any software that does this but let me know if you find one, would be interesting.

ADD COMMENTlink written 8.4 years ago by Rubal7770

I also posted the question on the GATK FAQ... will let you know the answer ! Yes, mastering a programming language definitely helps !

ADD REPLYlink written 8.4 years ago by Bioch'Ti1.0k

Just got a reply from the GATK team saying that it is not possible using their dedicated tools... :-/

ADD REPLYlink written 8.4 years ago by Bioch'Ti1.0k
0
gravatar for boczniak767
3.0 years ago by
boczniak767700
Poland
boczniak767700 wrote:

Late answer but maybe will be helpful for someone. This generates sequence for one indyvidual at a time.

First compress your vcf file

bgzip name.vcf > name.vcf.gz

Then make an index (requred by vcf-consensus)

tabix name.vcf.gz

Finally for SC9 individual

cat reference_genome.fa | vcf-consensus -s SC9 -H 1 name.vcf.gz > SC9_ref.fa

ADD COMMENTlink written 3.0 years ago by boczniak767700

It works well for extracting consensus file.

ADD REPLYlink written 7 weeks ago by aebrahimi580
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1413 users visited in the last hour
_