How To Identify Substitutions Between Populations In A Vcf
0
3
Entering edit mode
12.2 years ago
Rubal ▴ 350

Hi all,

I'm a PhD student working with whole genome data from multiple individuals from 3 populations. All the sequencing is low-coverage and combined in a vcf file. I have also created population specific vcf files for the 3 popns using GATK select variants.

I would like to compare 2 of the populations to create a file that contains only substitutions (fixed differences) between the populations. So where they are homozygous for different alleles. I have looked for a way to do this in GATK and vcf tools but found none. Unfotuntately nobody in my group has experience doing this sort of thing. If anyone can recommend a program that can do this kind of thing or can write a basic script that would be really helpful, I have only basic scripting experience in python.

It seems to me that custom scripting is the best way to go to do population genetic types of analyses on whole genome data, I haven't found any good packages, but I'll explore biopython in more detail.

Some specific details that might help:

--all samples from each population share same name in sample id, ie. Ta1, Ta2, Ta3 in one pop and Ag1, Ag2, Ag3 in another pop.

--I have low coverage so would be good to have a filter that only calls a variant as a substitution if there is data from atleast 10 samples from each population support this (I have 20 samples in each population). (so not calling a Substitution if only 9 individuals have reads at this site and also not calling it if 15 samples support it but one individual has a different variant-i'd rather exclude these false negatives for now to keep false positive rate low too)

Please let me know if this question is badly phrased so I can improve it.

Any help at all is greatly appreciated!

Many thanks,

Rubal

edit:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Agg50 Agg51 Agg52 Agg53 Agg54 Agg55 Agg56 Agg57 Agg59 Agg61 Agg62 Agg63 Agg64 Agg65 Agg66 Agg70 Agg73 Agg77 Agg78 Agg79 PhiX Tar11 Tar12 Tar13 Tar14 Tar15 Tar16 Tar17 Tar18 Tar19 Tar20 Tar21 Tar22 Tar23 Tar24 Tar25 Tar26 Tar27 Tar28 Tar39 Tar40 Will41 Will42 Will43 Will44 Will45 Will46 Will47 Will48 Will49 Will50 conflict unknown wrong
chr10 48 . G T 432.37 PASS AC=30;AF=1.000;AN=30;DP=16;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.1893;MQ=38.84;MQ0=0;QD=27.02 GT:AD:DP:GQ:PL ./. ./. 1/1:0,2:2:5.99:53,6,0 1/1:0,1:1:3:28,3,0
vcf population genome allele • 3.7k views
ADD COMMENT
0
Entering edit mode

can you please provide an example of input please ? some lines of your VCF matching your criteria ?

ADD REPLY
0
Entering edit mode

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Agg50 Agg51 Agg52 Agg53 Agg54 Agg55 Agg56 Agg57 Agg59 Agg61 Agg62 Agg63 Agg64 Agg65 Agg66 Agg70 Agg73 Agg77 Agg78 Agg79 PhiX Tar11 Tar12 Tar13 Tar14 Tar15 Tar16 Tar17 Tar18 Tar19 Tar20 Tar21 Tar22 Tar23 Tar24 Tar25 Tar26 Tar27 Tar28 Tar39 Tar40 Will41 Will42 Will43 Will44 Will45 Will46 Will47 Will48 Will49 Will50 conflict unknown wrong

chr10 48 . G T 432.37 PASS AC=30;AF=1.000;AN=30;DP=16;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.1893;MQ=38.84;MQ0=0;QD=27.02 GT:AD:DP:GQ:PL ./. ./. 1/1:0,2:2:5.99:53,6,0 1/1:0,1:1:3:28,3,0 ./

ADD REPLY
0
Entering edit mode

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Agg50 Agg51 Agg52 Agg53 Agg54 Agg55 Agg56 Agg57 Agg59 Agg61 Agg62 Agg63 Agg64 Agg65 Agg66 Agg70 Agg73 Agg77 Agg78 Agg79 PhiX Tar11 Tar12 Tar13 Tar14 Tar15 Tar16 Tar17 Tar18 Tar19 Tar20 Tar21 Tar22 Tar23 Tar24 Tar25 Tar26 Tar27 Tar28 Tar39 Tar40 Will41 Will42 Will43 Will44 Will45 Will46 Will47 Will48 Will49 Will50 conflict unknown wrong

chr10 48 . G T 432.37 PASS AC=30;AF=1.000;AN=30;DP=16;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.1893;MQ=38.84;MQ0=0;QD=27.02 GT:AD:DP:GQ:PL ./. ./. 1/1:0,2:2:5.99:53,6,0 1/1:0,1:1:3:28,3,0

ADD REPLY
0
Entering edit mode

That's a few lines from the vcf lsiting sample names (3 popns, Agg, Tar and Will). But I don;t have example lines where there are substitutions, that's what I want to find. Let me know if you wanted a different part of the input. Thanks for the interest.

ADD REPLY
0
Entering edit mode

thanks Pierre Lindenbaum!

ADD REPLY
0
Entering edit mode

Thanks for editing input lines Pierre Lindenbaum!

ADD REPLY
0
Entering edit mode

Hi all, I'm trying Snpsift, related to the snpEFF package, but not sure if it's exactly what I need and any suggestions still very welcome

ADD REPLY

Login before adding your answer.

Traffic: 1052 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6