Question: How To Identify Substitutions Between Populations In A Vcf
3
gravatar for Rubal
7.1 years ago by
Rubal210
Germany
Rubal210 wrote:

Hi all,

I'm a PhD student working with whole genome data from multiple individuals from 3 populations. All the sequencing is low-coverage and combined in a vcf file. I have also created population specific vcf files for the 3 popns using GATK select variants.

I would like to compare 2 of the populations to create a file that contains only substitutions (fixed differences) between the populations. So where they are homozygous for different alleles. I have looked for a way to do this in GATK and vcf tools but found none. Unfotuntately nobody in my group has experience doing this sort of thing. If anyone can recommend a program that can do this kind of thing or can write a basic script that would be really helpful, I have only basic scripting experience in python.

It seems to me that custom scripting is the best way to go to do population genetic types of analyses on whole genome data, I haven't found any good packages, but I'll explore biopython in more detail.

Some specific details that might help:

--all samples from each population share same name in sample id, ie. Ta1, Ta2, Ta3 in one pop and Ag1, Ag2, Ag3 in another pop.

--I have low coverage so would be good to have a filter that only calls a variant as a substitution if there is data from atleast 10 samples from each population support this (I have 20 samples in each population). (so not calling a Substitution if only 9 individuals have reads at this site and also not calling it if 15 samples support it but one individual has a different variant-i'd rather exclude these false negatives for now to keep false positive rate low too)

Please let me know if this question is badly phrased so I can improve it.

Any help at all is greatly appreciated!

Many thanks,

Rubal

edit:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Agg50 Agg51 Agg52 Agg53 Agg54 Agg55 Agg56 Agg57 Agg59 Agg61 Agg62 Agg63 Agg64 Agg65 Agg66 Agg70 Agg73 Agg77 Agg78 Agg79 PhiX Tar11 Tar12 Tar13 Tar14 Tar15 Tar16 Tar17 Tar18 Tar19 Tar20 Tar21 Tar22 Tar23 Tar24 Tar25 Tar26 Tar27 Tar28 Tar39 Tar40 Will41 Will42 Will43 Will44 Will45 Will46 Will47 Will48 Will49 Will50 conflict unknown wrong
chr10 48 . G T 432.37 PASS AC=30;AF=1.000;AN=30;DP=16;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.1893;MQ=38.84;MQ0=0;QD=27.02 GT:AD:DP:GQ:PL ./. ./. 1/1:0,2:2:5.99:53,6,0 1/1:0,1:1:3:28,3,0
vcf genome population allele • 2.3k views
ADD COMMENTlink modified 7.1 years ago by Pierre Lindenbaum118k • written 7.1 years ago by Rubal210

can you please provide an example of input please ? some lines of your VCF matching your criteria ?

ADD REPLYlink written 7.1 years ago by Pierre Lindenbaum118k

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Agg50 Agg51 Agg52 Agg53 Agg54 Agg55 Agg56 Agg57 Agg59 Agg61 Agg62 Agg63 Agg64 Agg65 Agg66 Agg70 Agg73 Agg77 Agg78 Agg79 PhiX Tar11 Tar12 Tar13 Tar14 Tar15 Tar16 Tar17 Tar18 Tar19 Tar20 Tar21 Tar22 Tar23 Tar24 Tar25 Tar26 Tar27 Tar28 Tar39 Tar40 Will41 Will42 Will43 Will44 Will45 Will46 Will47 Will48 Will49 Will50 conflict unknown wrong

chr10 48 . G T 432.37 PASS AC=30;AF=1.000;AN=30;DP=16;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.1893;MQ=38.84;MQ0=0;QD=27.02 GT:AD:DP:GQ:PL ./. ./. 1/1:0,2:2:5.99:53,6,0 1/1:0,1:1:3:28,3,0 ./

ADD REPLYlink written 7.1 years ago by Rubal210

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Agg50 Agg51 Agg52 Agg53 Agg54 Agg55 Agg56 Agg57 Agg59 Agg61 Agg62 Agg63 Agg64 Agg65 Agg66 Agg70 Agg73 Agg77 Agg78 Agg79 PhiX Tar11 Tar12 Tar13 Tar14 Tar15 Tar16 Tar17 Tar18 Tar19 Tar20 Tar21 Tar22 Tar23 Tar24 Tar25 Tar26 Tar27 Tar28 Tar39 Tar40 Will41 Will42 Will43 Will44 Will45 Will46 Will47 Will48 Will49 Will50 conflict unknown wrong

chr10 48 . G T 432.37 PASS AC=30;AF=1.000;AN=30;DP=16;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.1893;MQ=38.84;MQ0=0;QD=27.02 GT:AD:DP:GQ:PL ./. ./. 1/1:0,2:2:5.99:53,6,0 1/1:0,1:1:3:28,3,0

ADD REPLYlink written 7.1 years ago by Rubal210

That's a few lines from the vcf lsiting sample names (3 popns, Agg, Tar and Will). But I don;t have example lines where there are substitutions, that's what I want to find. Let me know if you wanted a different part of the input. Thanks for the interest.

ADD REPLYlink written 7.1 years ago by Rubal210

thanks Pierre Lindenbaum!

ADD REPLYlink written 7.1 years ago by Rubal210

Thanks for editing input lines Pierre Lindenbaum!

ADD REPLYlink written 7.1 years ago by Rubal210

Hi all, I'm trying Snpsift, related to the snpEFF package, but not sure if it's exactly what I need and any suggestions still very welcome

ADD REPLYlink written 7.1 years ago by Rubal210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2520 users visited in the last hour