Question: Count variants between two samples within a multi-sample VCF
0
gravatar for rc16955
3 months ago by
rc1695550
rc1695550 wrote:

Hi all,

I have a vcf file containing variant data for 52 samples.

  • sample 1
  • sample 2
  • sample 3
  • etc., etc.

What I would like to do is perform pairwise comparisons where I count the number of variants (SNPs and small INDELs) between each sample and each other sample.

  • number of variants between sample 1 and sample 2
  • number of variants between sample 1 and sample 3
  • and so on for every pairwise comparison possible.

I'm not looking to count the number of variants across all samples, nor the number of variants between each sample and the reference assembly, as I already have these.

I had been hoping that VCFTools would have a function for this, but from checking the manual, it seems not? If I have missed something in VCFTools, please let me know. Otherwise, I would really appreciate links to python, perl or bash scripts that can do what I need, or recommendations for other software that might help.

Many many thanks in advance.

vcftools vcf • 189 views
ADD COMMENTlink modified 3 months ago by chrchang5234.8k • written 3 months ago by rc1695550
3
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum117k wrote:

using http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html:

for eac variant V a loop 'i' over sample1, another loop 'j' over sample 2. If this is the same genotype increase the hashmap 'count'. at the end, dump 'count.

java -jar dist/bioalcidaejdk.jar -e 'final Map<String,Long> count=new HashMap<>(); stream().forEach(V->{for(int i=0;i +1 < V.getNSamples();i++) for(int j=i+1;j < V.getNSamples();j++) {final Genotype gi= V.getGenotype(i);final Genotype gj= V.getGenotype(j); if(gi.sameGenotype(gj)) {String key=gi.getSampleName()+"_"+gj.getSampleName(); long n = 1L + (count.containsKey(key)?count.get(key):0L); count.put(key,n); };} });  count.forEach((K,V)->println(K+" "+V)); ' in.vcf
S1_S3 21
S2_S4 18
S1_S2 21
S2_S3 45
S3_S4 18
S1_S5 28
S4_S5 23
S1_S4 22
S2_S5 22
S3_S5 22
ADD COMMENTlink written 3 months ago by Pierre Lindenbaum117k

Thanks very much for the suggestion, I have given it a try and it seems to work well!

Can I just confirm that the number output for each pair is the number of positions where the two samples undergoing comparison have the same genotype, so that more similar samples have higher numbers in the output?

Thanks again!

ADD REPLYlink written 3 months ago by rc1695550

Can I just confirm that the number output for each pair is the number of positions where the two samples undergoing comparison have the same genotype, so that more similar samples have higher numbers in the output?

yes

ADD REPLYlink written 3 months ago by Pierre Lindenbaum117k

Hello rc16955 ,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink written 3 months ago by finswimmer11k

Sorry to double post. You seem familiar with BioAlcidaeJdk, do you think it could be used for the next step in my analysis? I'd like to be able to see where in the genome variants within pairwise comparisons are falling. Would it be possible, using BioAlcidaeJdk, to get a list of each genomic position (i.e. scaffold + position on that scaffold) that is variant within a particular pairwise comparison? Something like

Chr   Pos   S1    S2

1     324   A/A   A/T

1     725   T/G   G/G

2     183    C/C  T/T

Obviously the exact format of the output wouldn't be important, as long as it contained the above information. I only include that to give an idea of what I mean

ADD REPLYlink modified 3 months ago • written 3 months ago by rc1695550

You seem familiar with BioAlcidaeJdk

a little bit

Would it be possible, using BioAlcidaeJdk, to get a list of each genomic position (i.e. scaffold + position on that scaffold) that is variant within a particular pairwise comparison?

bad example , none of your genotypes is the same for each variant.

ADD REPLYlink written 3 months ago by Pierre Lindenbaum117k
0
gravatar for chrchang523
3 months ago by
chrchang5234.8k
United States
chrchang5234.8k wrote:

With plink 1.9, plink --vcf my.vcf --genome full should provide the necessary counts; add the IBS0 and IBS1 columns to get the total number of differences between two samples.

ADD COMMENTlink written 3 months ago by chrchang5234.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2524 users visited in the last hour