Question: filtering SNPs by samples?
gravatar for jingjin2203
7 days ago by
jingjin220330 wrote:

Hi all,

I have a bcf file containing SNPs called across 12 different samples. I am specifically interested in the SNPs that occurred in the first 6 samples but are absent in the second 6 samples.

I was wondering if there is any tool that would allow me to do this?

Any help is appreciated! Thanks a lot!

sequencing snp bcftools • 86 views
ADD COMMENTlink modified 7 days ago by guillaume.rbt560 • written 7 days ago by jingjin220330
gravatar for guillaume.rbt
7 days ago by
guillaume.rbt560 wrote:

You can use SnpSift to filter a vcf ( ).

For your problem it would look something like :

cat snps.vcf | java -Xmx20000m -jar SnpSift.jar filter "GEN[0].GT = 1 & GEN[1].GT = 1 & GEN[2].GT = 1 & GEN[3].GT = 1 & GEN[4].GT = 1 & GEN[5].GT = 1 & GEN[6].GT = 1 & GEN[7].GT = 0 & GEN[8].GT = 0 & GEN[9].GT = 0 & GEN[10].GT = 0 & GEN[11].GT = 0 & GEN[12].GT = 0"  > filtered_snps.vcf

(it would keep SNPs that are all genotyped as variant for all first 6 samples, and genotyped as reference for the last 6 samples)

ADD COMMENTlink modified 7 days ago • written 7 days ago by guillaume.rbt560

Thanks a lot! Really appreciated it! I am currently trying your recommendation, but having some trouble. Hope you could further help me out. 1. The SnpSift manual says I can create an expression using sample names instead of genotype numbers, so I put sample1name in GEN[], am I doing it right? 2. The organism I am working on is diploid, so I changed the genotype type to 1/1 instead of 1.

cat snps.vcf | java -Xmx20000m -jar SnpSift.jar filter "GEN[sample1name].GT = 1/1"

It gives me the following output. Could you please correct me if I did anything wrong? Thanks again!

Exception in thread "main" java.lang.RuntimeException: INFO field 'H' not found in VCF header
at org.snpsift.lang.expression.Field.getReturnType(
at org.snpsift.lang.expression.Field.eval(
at org.snpsift.lang.expression.ExpressionBinary.eval(
at org.snpsift.lang.expression.ExpressionBinary.eval(
at org.snpsift.lang.expression.FieldSub.evalIndex(
at org.snpsift.lang.expression.FieldSub.evalIndex(
at org.snpsift.lang.expression.FieldGenotype.evalGenotype(
at org.snpsift.lang.expression.FieldGenotype.getFieldString(
at org.snpsift.lang.expression.Field.eval(
at org.snpsift.lang.expression.ExpressionBinary.eval(
at org.snpsift.SnpSiftCmdFilter.evaluate(
at org.snpsift.SnpSiftCmdFilter.annotate(
at org.snpsift.SnpSift.main(
ADD REPLYlink modified 7 days ago • written 7 days ago by jingjin220330

You're welcome! It seems that there is something wrong with the header of your vcf.

To use samples names instead of their position you have to be sure to have a properly formated header in your vcf.

It must be like described here :

With the samples names like this (it's a tabulation between each fields):


Also the syntax of diploid genotypes requires simples quotes in the SnpSift command :

cat cancer.vcf | java -jar SnpSift.jar filter "GEN[Somatic].GT = '2/1'"
ADD REPLYlink modified 6 days ago • written 6 days ago by guillaume.rbt560

Hi guillaume.rbt, thank you for your kind reply!! After modifying the code, I had the same error as I posted above. I am not aware of anything wrong in the header of my vcf file. I attached a small section of my vcf file, it would be great if you could help me take a look at it. Thanks again!


#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT H3-H-3.sorted.bam H3-H-5.sorted.bam H3-H-6.sorted.bam H3-WZ-1.sorted.bam H3-WZ-2.sorted.bam H3-WZ-5.sorted.bam WZ3-H-1.sorted.bam WZ3-H-2.sorted.bam WZ3-H-6.sorted.bam WZ3-WZ-1.sorted.bam WZ3-WZ-2.sorted.bam WZ3-WZ-3.sorted.bam

7000000185249020 4947 . C T 47.8636 . DP=133;VDB=0.109869;SGB=3.0073;RPB=0.997443;MQB=0.493166;BQB=0.964275;MQ0F=0.0225564;ICB=0.0072327;HOB=0.00347222;AC=1;AN=24;DP4=125,0,6,0;MQ=43 GT:PL 0/0:0,93,183 0/0:0,12,62 0/0:0,18,120 0/0:0,9,18 0/0:0,12,40 0/0:0,36,164 0/0:0,3,36 0/0:0,27,157 0/0:0,51,170 0/0:0,18,63 0/0:0,60,138 0/1:87,0,146

ADD REPLYlink modified 4 days ago • written 4 days ago by jingjin220330

Indeed your VCF header seems ok, have you tried without the samples names but with their position number to see if it would work?

(for example, a test for just your first sample, H3-H-3.sorted.bam, being alternative homozygote : SnpSift.jar filter "GEN[0].GT=='1/1'" )

ADD REPLYlink written 3 days ago by guillaume.rbt560

It actually worked when I tried "GEN[0].GT=='1/1'" instead of using sample name! I guess I will go with that. Sorry for keeping bugging you, I just had another question. Does the 0 in GEN[0] refer to the first sample in VCF file? In my case, I have 12 samples, so I would put GEN[0] to GEN[11] in the code, is that correct? I guess it is a little bit counter intuitive to me using 0 to refer to the first sample.

Thanks a lot! I really appreciated your time and help!

ADD REPLYlink written 2 days ago by jingjin220330

Sorry I couldnt't help you with the sample name issue! Yes indeed the 0 correspond to the first sample, and 11 would be the 12th sample. Best of luck with your analysis

ADD REPLYlink written 2 days ago by guillaume.rbt560
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1654 users visited in the last hour