Vcf file with 3 samples and trying to count how many mutations for each sample
1
0
Entering edit mode
3.7 years ago
kadamek49 • 0

Hello,

I have a vcf file and want to find the counts for how many variants are occurring for each of the 3 samples independently. I have a vcf that shows the different variants between the samples (ex. 0/1 0/0 1/1 vs 0/1 0/1 0/1). I have tried some of the code and help from other posts but nothing is properly working. I have tried the codes below and they did not work or were incorrectly counting.

Any help would be greatly appreciated. Thank you!

$ java -jar dist/bioalcidaejdk.jar  -e 'stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));'  example.vcf

$ java -jar dist/bioalcidaejdk.jar -e 'final Map<String,Long> count=new HashMap<>(); stream().forEach(V->{for(int i=0;i +1 < V.getNSamples();i++) for(int j=i+1;j < V.getNSamples();j++) {final Genotype gi= V.getGenotype(i);final Genotype gj= V.getGenotype(j); if(gi.sameGenotype(gj)) {String key=gi.getSampleName()+"_"+gj.getSampleName(); long n = 1L + (count.containsKey(key)?count.get(key):0L); count.put(key,n); };} });  count.forEach((K,V)->println(K+" "+V)); ' example.vcf

Below are 2 examples from the codes used above. The first code didn't seem to work as I went through a .hmp file of the variants on excel and manually went through a couple of chromosomes counting for a sample and the numbers didn't add up for that chromosome. The second code gave back counts that were the same for all 3 samples and I know that isn't correct.

[DEBUG][BioAlcidaeJdk] Compiling :
         1  import java.util.*;
         2  import java.util.stream.*;
         3  import java.util.regex.*;
         4  import java.util.function.*;
         5  import htsjdk.samtools.*;
         6  import htsjdk.samtools.util.*;
         7  import htsjdk.variant.variantcontext.*;
         8  import htsjdk.variant.vcf.*;
         9  import com.github.lindenb.jvarkit.pedigree.*;
        10  import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence;
        11  import com.github.lindenb.jvarkit.math.RangeOfIntegers;
        12  import com.github.lindenb.jvarkit.math.RangeOfDoubles;
        13  import com.github.lindenb.jvarkit.util.Counter;
        14  /** begin user's packages */
        15  /** end user's packages */
        16  @javax.annotation.processing.Generated(value="BioAlcidaeJdk",date="2020-07-13T19:35:26-0400")
        17  public class BioAlcidaeJdkCustom190253546 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.VcfHandler {
        18    public BioAlcidaeJdkCustom190253546() {
        19    }
        20    @Override
        21    public void execute() throws Exception {
        22     // user's code starts here
        23     stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));
        24      //user's code ends here
        25     }
        26  }
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by htsjdk.samtools.util.CloserUtil (file:/home/kris/jvarkit/dist/bioalcidaejdk.jar) to method com.sun.xml.internal.stream.XMLEventReaderImpl.close()
WARNING: Please consider reporting this to the maintainers of htsjdk.samtools.util.CloserUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[WARN][SnpEffPredictionParser]no INFO[EFF] or no description. This VCF was probably NOT annotated with SnpEff (old version). But it's not a problem if this tool doesn't need to access SnpEff Annotations.
[WARN][VepPredictionParser]NO INFO[CSQ] found in header. This VCF was probably NOT annotated with VEP. But it's not a problem if this tool doesn't need to access VEP Annotations.
[WARN][AnnPredictionParser]no INFO[ANN] or no description This VCF was probably NOT annotated with SnpEff(ANN version) . But it's not a problem if this tool doesn't need to access SnpEff Annotations.
NC_044378.1 Can_Top.sort:280
NC_044374.1 Can_Bottom.sort:532
NC_044375.1 Can_Mid.sort:406
NW_022060378.1  Can_Top.sort:2
NW_022060402.1  Can_Bottom.sort:1
NW_022060485.1  Can_Top.sort:1
NC_044375.1 Can_Bottom.sort:410
NW_022060494.1  Can_Top.sort:1
NW_022060455.1  Can_Bottom.sort:1
NC_044378.1 Can_Mid.sort:307

Second code example:

[DEBUG][BioAlcidaeJdk] Compiling :
         1  import java.util.*;
         2  import java.util.stream.*;
         3  import java.util.regex.*;
         4  import java.util.function.*;
         5  import htsjdk.samtools.*;
         6  import htsjdk.samtools.util.*;
         7  import htsjdk.variant.variantcontext.*;
         8  import htsjdk.variant.vcf.*;
         9  import com.github.lindenb.jvarkit.pedigree.*;
        10  import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence;
        11  import com.github.lindenb.jvarkit.math.RangeOfIntegers;
        12  import com.github.lindenb.jvarkit.math.RangeOfDoubles;
        13  import com.github.lindenb.jvarkit.util.Counter;
        14  /** begin user's packages */
        15  /** end user's packages */
        16  @javax.annotation.processing.Generated(value="BioAlcidaeJdk",date="2020-07-13T19:16:03-0400")
        17  public class BioAlcidaeJdkCustom2039591783 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.VcfHandler {
        18    public BioAlcidaeJdkCustom2039591783() {
        19    }
        20    @Override
        21    public void execute() throws Exception {
        22     // user's code starts here
        23     final Map<String,Long> count=new HashMap<>(); stream().forEach(V->{for(int i=0;i +1 < V.getNSamples();i++) for(int j=i+1;j < V.getNSamples();j++) {final Genotype gi= V.getGenotype(i);final Genotype gj= V.getGenotype(j); if(gi.sameGenotype(gj)) {String key=gi.getSampleName()+"_"+gj.getSampleName(); long n = 1L + (count.containsKey(key)?count.get(key):0L); count.put(key,n); };} });  count.forEach((K,V)->println(K+" "+V));
        24      //user's code ends here
        25     }
        26  }
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by htsjdk.samtools.util.CloserUtil (file:/home/kris/jvarkit/dist/bioalcidaejdk.jar) to method com.sun.xml.internal.stream.XMLEventReaderImpl.close()
WARNING: Please consider reporting this to the maintainers of htsjdk.samtools.util.CloserUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[WARN][SnpEffPredictionParser]no INFO[EFF] or no description. This VCF was probably NOT annotated with SnpEff (old version). But it's not a problem if this tool doesn't need to access SnpEff Annotations.
[WARN][VepPredictionParser]NO INFO[CSQ] found in header. This VCF was probably NOT annotated with VEP. But it's not a problem if this tool doesn't need to access VEP Annotations.
[WARN][AnnPredictionParser]no INFO[ANN] or no description This VCF was probably NOT annotated with SnpEff(ANN version) . But it's not a problem if this tool doesn't need to access SnpEff Annotations.
Can_Bottom.sort_Can_Mid.sort 2563
Can_Mid.sort_Can_Top.sort 2563
Can_Bottom.sort_Can_Top.sort 2563
vcf bcftools snpEff • 1.1k views
ADD COMMENT
1
Entering edit mode

the first code counts the genotypes carrying a ALT variant per chromosome per sample

the second code count the nomber of time the genotype for a sample is the same that for another sample.

ADD REPLY
2
Entering edit mode
3.6 years ago

When dealing with vcf files, bcftools should be the first thing to try, so here's how I'd do it:

1) bgzip file, 2) tabix index it, 3) loop through samples, 4) extract each sample's private variants, 5) count them

bgzip -f example.vcf; tabix -fp vcf example.vcf.gz
for sample in `bcftools query -l example.vcf.gz`; do
 echo -n "$sample "
 bcftools view -c1 -s $sample example.vcf.gz | grep -cv ^#
done

Having said that, you could try coding it yourself, so here's how I'd do it:

1) loop through the 3 vcf sample columns, 2) extract sample name, 3) count all non-0-second-gt-allele variants

for i in 10 11 12; do
 sample=$(grep -m1 ^#CHROM example.vcf | cut -f$i)
 echo -n "$sample "
 grep -v ^# example.vcf | cut -f$i | grep -cP "^\d[/|][1-9]"
done

If you don't know the number of samples in advance, you could go for a more generic for loop:

for i in $(seq 10 `grep -m1 ^#CHROM example.vcf | awk '{print NF}'`); do
ADD COMMENT

Login before adding your answer.

Traffic: 2826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6