How To Get Consensus Sequence Including Variant From A Vcf File For Several Individuals...?
3
1
Entering edit mode
9.6 years ago
Bioch'Ti ★ 1.0k

Hi members,

I have a VCF file (from GATK) containing variants for a total of 20 individuals and I'm wondering how to get the consensus sequences for each individual regarding its own polymorphism. Some individuals may not show polymorphism at a particular position in a contig whereas some others may. I've checked the GATK dedicated tool (FastaAlternateReferenceMaker) but it doesn't answer my question as only one consensus is generated. My requirement would be to get as many outputs files (containing consensus file) as mapped individuals.

Do any of you faced a similar question?

Thanks for your reply, Best, C.

vcf consensus gatk variant calling • 11k views
ADD COMMENT
0
Entering edit mode

what do you mean with "consensus sequence" ? how is it different from the REF/ALT columns ? can you show us a few rows of your VCF ?

ADD REPLY
0
Entering edit mode

By consensus sequence, for each individual, I mean that I would like to obtain one sequence per individual that contain all the variant site included in the reference sequence used for mapping. Maybe the use of 'consensus' is confusing.

Here is a subset of my VCF, showing the first two variant sites for 20 individuals:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SC1 SC10 SC2 SC3 SC4 SC5 SC6 SC7 SC8 SC9 SS1 SS10 SS2 SS3 SS4 SS5 SS6 SS7 SS8 SS9

Solyc00g005840.2.1271.CT6263.99PASS"AC=4;AF=0.100;AN=40;BaseQRankSum=-1.639;DP=2348;DS;Dels=0.00;FS=3.462;HRun=0;HaplotypeScore=4.1294;InbreedingCoeff=1.0000;MQ=58.30;MQ0=0;MQRankSum=2.601;QD=38.20;ReadPosRankSum=3.569"GT:AD:DP:GQ:PL0/0:197,0:198:99:0,587,76611/1:5,78:83:57.67:2803,58,00/0:108,0:108:99:0,322,41910/0:132,0:132:99:0,394,51331/1:1,79:81:99:3088,205,00/0:99,0:100:99:0,295,38650/0:83,0:83:99:0,244,31820/0:51,0:51:99:0,150,19780/0:42,0:43:99:0,123,16430/0:91,0:91:99:0,268,34710/0:131,0:132:99:0,367,48640/0:178,0:178:99:0,533,70100/0:73,0:73:99:0,220,29080/0:112,0:112:99:0,337,44360/0:147,0:147:99:0,436,56900/0:154,0:154:99:0,460,61490/0:153,0:153:99:0,457,60740/0:221,0:221:99:0,656,85990/0:123,0:123:99:0,364,48320/0:84,0:85:99:0,253,3360

Solyc00g005840.2.1429.CG2912.52PASS"AC=1;AF=0.025;AN=40;BaseQRankSum=2.400;DP=1933;Dels=0.00;FS=0.812;HRun=0;HaplotypeScore=2.3271;InbreedingCoeff=0.9264;MQ=58.33;MQ0=0;MQRankSum=1.855;QD=29.72;ReadPosRankSum=-1.158"GT:AD:DP:GQ:PL0/0:147,0:147:99:0,436,57010/0:59,0:59:99:0,175,22210/0:91,0:91:99:0,274,36280/0:99,0:99:99:0,292,37450/0:78,0:79:99:0,229,29470/0:80,0:80:99:0,238,31650/0:71,0:71:99:0,208,27590/0:55,0:55:99:0,166,21910/0:33,0:33:96.31:0,96,12830/0:60,0:60:99:0,181,23840/1:7,91:98:0.61:2963,8,00/0:144,0:144:99:0,427,56320/0:73,0:73:99:0,220,28970/0:91,0:91:99:0,271,36080/0:121,0:121:99:0,355,46200/0:130,0:130:99:0,385,51580/0:120,0:120:99:0,361,48260/0:172,0:172:99:0,509,67550/0:108,0:108:99:0,325,43260/0:102,0:102:99:0,307,4136
ADD REPLY
2
Entering edit mode
9.6 years ago
Bioch'Ti ★ 1.0k

Hi Guys,

An answer was proposed on the GATK Forum, here it is:

Use SelectVariants to get one VCF file per individual and then run FastaAlternateReferenceMaker to generate the sequences on each of these individual VCF file ! I checked using a ClustalW alignment and found my SNPs back at the right position depending of the individual. The drawback may be when dealing with a high number of individuals (have to repeat the command). Then, merge the individual fasta files according to the purpose (e.g. software input).

http://gatkforums.broadinstitute.org/discussion/1654/fastaalternatereferencemaker-for-several-individuals

Hope this helps ! C.

ADD COMMENT
1
Entering edit mode
4.3 years ago
boczniak767 ▴ 830

Late answer but maybe will be helpful for someone. This generates sequence for one indyvidual at a time.

First compress your vcf file

bgzip name.vcf > name.vcf.gz

Then make an index (requred by vcf-consensus)

tabix name.vcf.gz

Finally for SC9 individual

cat reference_genome.fa | vcf-consensus -s SC9 -H 1 name.vcf.gz > SC9_ref.fa

ADD COMMENT
0
Entering edit mode

It works well for extracting consensus file.

ADD REPLY
0
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode
9.6 years ago
Rubal7 ▴ 820

Do you know any programming? You could read through each row of the VCF file and take the allele that appears most frequently on each line. Currently I don't know any software that does this but let me know if you find one, would be interesting.

ADD COMMENT
0
Entering edit mode

I also posted the question on the GATK FAQ... will let you know the answer ! Yes, mastering a programming language definitely helps !

ADD REPLY
0
Entering edit mode

Just got a reply from the GATK team saying that it is not possible using their dedicated tools... :-/

ADD REPLY

Login before adding your answer.

Traffic: 799 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6