Count variants between two samples within a multi-sample VCF
2
0
Entering edit mode
3.6 years ago
rc16955 ▴ 80

Hi all,

I have a vcf file containing variant data for 52 samples.

• sample 1
• sample 2
• sample 3
• etc., etc.

What I would like to do is perform pairwise comparisons where I count the number of variants (SNPs and small INDELs) between each sample and each other sample.

• number of variants between sample 1 and sample 2
• number of variants between sample 1 and sample 3
• and so on for every pairwise comparison possible.

I'm not looking to count the number of variants across all samples, nor the number of variants between each sample and the reference assembly, as I already have these.

I had been hoping that VCFTools would have a function for this, but from checking the manual, it seems not? If I have missed something in VCFTools, please let me know. Otherwise, I would really appreciate links to python, perl or bash scripts that can do what I need, or recommendations for other software that might help.

vcf VCFTools • 2.1k views
3
Entering edit mode
3.6 years ago

for eac variant V a loop 'i' over sample1, another loop 'j' over sample 2. If this is the same genotype increase the hashmap 'count'. at the end, dump 'count.

java -jar dist/bioalcidaejdk.jar -e 'final Map<String,Long> count=new HashMap<>(); stream().forEach(V->{for(int i=0;i +1 < V.getNSamples();i++) for(int j=i+1;j < V.getNSamples();j++) {final Genotype gi= V.getGenotype(i);final Genotype gj= V.getGenotype(j); if(gi.sameGenotype(gj)) {String key=gi.getSampleName()+"_"+gj.getSampleName(); long n = 1L + (count.containsKey(key)?count.get(key):0L); count.put(key,n); };} });  count.forEach((K,V)->println(K+" "+V)); ' in.vcf
S1_S3 21
S2_S4 18
S1_S2 21
S2_S3 45
S3_S4 18
S1_S5 28
S4_S5 23
S1_S4 22
S2_S5 22
S3_S5 22

0
Entering edit mode

Thanks very much for the suggestion, I have given it a try and it seems to work well!

Can I just confirm that the number output for each pair is the number of positions where the two samples undergoing comparison have the same genotype, so that more similar samples have higher numbers in the output?

Thanks again!

0
Entering edit mode

Can I just confirm that the number output for each pair is the number of positions where the two samples undergoing comparison have the same genotype, so that more similar samples have higher numbers in the output?

yes

0
Entering edit mode

Hello rc16955 ,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

0
Entering edit mode

Sorry to double post. You seem familiar with BioAlcidaeJdk, do you think it could be used for the next step in my analysis? I'd like to be able to see where in the genome variants within pairwise comparisons are falling. Would it be possible, using BioAlcidaeJdk, to get a list of each genomic position (i.e. scaffold + position on that scaffold) that is variant within a particular pairwise comparison? Something like

Chr   Pos   S1    S2

1     324   A/A   A/T

1     725   T/G   G/G

2     183    C/C  T/T


Obviously the exact format of the output wouldn't be important, as long as it contained the above information. I only include that to give an idea of what I mean

0
Entering edit mode

You seem familiar with BioAlcidaeJdk

a little bit

Would it be possible, using BioAlcidaeJdk, to get a list of each genomic position (i.e. scaffold + position on that scaffold) that is variant within a particular pairwise comparison?

bad example , none of your genotypes is the same for each variant.

0
Entering edit mode
3.6 years ago

With plink 1.9, plink --vcf my.vcf --genome full should provide the necessary counts; add the IBS0 and IBS1 columns to get the total number of differences between two samples.