Question: Count variants between two samples within a multi-sample VCF
0
gravatar for rc16955
2.3 years ago by
rc1695570
rc1695570 wrote:

Hi all,

I have a vcf file containing variant data for 52 samples.

  • sample 1
  • sample 2
  • sample 3
  • etc., etc.

What I would like to do is perform pairwise comparisons where I count the number of variants (SNPs and small INDELs) between each sample and each other sample.

  • number of variants between sample 1 and sample 2
  • number of variants between sample 1 and sample 3
  • and so on for every pairwise comparison possible.

I'm not looking to count the number of variants across all samples, nor the number of variants between each sample and the reference assembly, as I already have these.

I had been hoping that VCFTools would have a function for this, but from checking the manual, it seems not? If I have missed something in VCFTools, please let me know. Otherwise, I would really appreciate links to python, perl or bash scripts that can do what I need, or recommendations for other software that might help.

Many many thanks in advance.

vcftools vcf • 1.0k views
ADD COMMENTlink modified 2.3 years ago by chrchang5237.7k • written 2.3 years ago by rc1695570
3
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

using http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html:

for eac variant V a loop 'i' over sample1, another loop 'j' over sample 2. If this is the same genotype increase the hashmap 'count'. at the end, dump 'count.

java -jar dist/bioalcidaejdk.jar -e 'final Map<String,Long> count=new HashMap<>(); stream().forEach(V->{for(int i=0;i +1 < V.getNSamples();i++) for(int j=i+1;j < V.getNSamples();j++) {final Genotype gi= V.getGenotype(i);final Genotype gj= V.getGenotype(j); if(gi.sameGenotype(gj)) {String key=gi.getSampleName()+"_"+gj.getSampleName(); long n = 1L + (count.containsKey(key)?count.get(key):0L); count.put(key,n); };} });  count.forEach((K,V)->println(K+" "+V)); ' in.vcf
S1_S3 21
S2_S4 18
S1_S2 21
S2_S3 45
S3_S4 18
S1_S5 28
S4_S5 23
S1_S4 22
S2_S5 22
S3_S5 22
ADD COMMENTlink written 2.3 years ago by Pierre Lindenbaum134k

Thanks very much for the suggestion, I have given it a try and it seems to work well!

Can I just confirm that the number output for each pair is the number of positions where the two samples undergoing comparison have the same genotype, so that more similar samples have higher numbers in the output?

Thanks again!

ADD REPLYlink written 2.3 years ago by rc1695570

Can I just confirm that the number output for each pair is the number of positions where the two samples undergoing comparison have the same genotype, so that more similar samples have higher numbers in the output?

yes

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum134k

Hello rc16955 ,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink written 2.3 years ago by finswimmer14k

Sorry to double post. You seem familiar with BioAlcidaeJdk, do you think it could be used for the next step in my analysis? I'd like to be able to see where in the genome variants within pairwise comparisons are falling. Would it be possible, using BioAlcidaeJdk, to get a list of each genomic position (i.e. scaffold + position on that scaffold) that is variant within a particular pairwise comparison? Something like

Chr   Pos   S1    S2

1     324   A/A   A/T

1     725   T/G   G/G

2     183    C/C  T/T

Obviously the exact format of the output wouldn't be important, as long as it contained the above information. I only include that to give an idea of what I mean

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by rc1695570

You seem familiar with BioAlcidaeJdk

a little bit

Would it be possible, using BioAlcidaeJdk, to get a list of each genomic position (i.e. scaffold + position on that scaffold) that is variant within a particular pairwise comparison?

bad example , none of your genotypes is the same for each variant.

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum134k
0
gravatar for chrchang523
2.3 years ago by
chrchang5237.7k
United States
chrchang5237.7k wrote:

With plink 1.9, plink --vcf my.vcf --genome full should provide the necessary counts; add the IBS0 and IBS1 columns to get the total number of differences between two samples.

ADD COMMENTlink written 2.3 years ago by chrchang5237.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2698 users visited in the last hour
_