Question: Extracting Specific Reads From Vcf File
gravatar for Bnfoguy
8.5 years ago by
New Jersey, USA
Bnfoguy70 wrote:


I am trying to extract indels with coverage ~1 and they are in a VCF file. How do I write a python script or a shell commnad to extract those indels with coverage 1 only?

This is the data:

chr1    1548545 .       T       TAGATCG 22.5    .       INDEL;DP=2;VDB=0.0251;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-37.5    GT:PL:GQ        1/1:60,3,0:6
chr1    1573532 .       C       CT      4.42    .       INDEL;DP=1;AF1=1;AC1=2;DP4=0,0,0,1;MQ=44;FQ=-37.5       GT:PL:GQ        0/1:40,3,0:3
chr1    1577221 .       AAC     AACAAAC 81.2    .       INDEL;DP=2;VDB=0.0201;AF1=1;AC1=2;DP4=0,0,1,1;MQ=60;FQ=-40.5    GT:PL:GQ        1/1:120,6,0:10
chr1    1577223 .       C       CAAAC   81.2    .       INDEL;DP=2;VDB=0.0185;AF1=1;AC1=2;DP4=0,0,1,1;MQ=60;FQ=-40.5    GT:PL:GQ        1/1:120,6,0:10
chr1    1584063 .       T       TT      3.25    .       INDEL;DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-37.5       GT:PL:GQ        0/1:38,3,0:4
chr1    1635699 .       AA      A       3.66    .       INDEL;DP=2;VDB=0.0231;AF1=1;AC1=2;DP4=0,0,1,1;MQ=20;FQ=-40.5    GT:PL:GQ        1/1:40,6,0:4
chr1    1637760 .       T       TTTCTTT 22.5    .       INDEL;DP=1;AF1=1;AC1=2;DP4=0,0,0,1;MQ=60;FQ=-37.5       GT:PL:GQ        1/1:60,3,0:6

Thank you for your help!

vcf indel script file python • 3.7k views
ADD COMMENTlink modified 3.6 years ago by Begali0 • written 8.5 years ago by Bnfoguy70
gravatar for Pierre Lindenbaum
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

something like

awk -F '      '       '(!($4 ~ /[ATGC]/ && $5 ~/[ATGC]/ ))' < your.vcf | grep "DP=1;"

or better

grep INDEL < your.vcf | grep "DP=1;"
ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Pierre Lindenbaum133k

Thanks Mr. Lindenbaum. You are a lifesaver!

ADD REPLYlink written 8.5 years ago by Bnfoguy70
gravatar for tommy.carstensen
8.4 years ago by
tommy.carstensen30 wrote:

Alternatively: awk '/INDEL/ && /DP=1/' your.vcf

ADD COMMENTlink written 8.4 years ago by tommy.carstensen30
gravatar for bioinfo
8.4 years ago by
bioinfo790 wrote:

I prefer ( I assume the vcf file contains just Indels not SNPs)

more indel.vcf | grep "DP=1" &gt > filtered.indel.vcf

If you have SNPs and indel together in the same vcf file then

more indel.vcf | grep "INDEL" | grep "DP=1" &gt > filtered.indel.vcf
ADD COMMENTlink modified 3.5 years ago • written 8.4 years ago by bioinfo790
gravatar for Begali
3.6 years ago by
Begali0 wrote:


I am new for tools such as samtools, bcftools, BWA and GATK and I am working on RADSeq I reach for point call variants which include ( INDEL and SNP) now time to filter them I could not do that by GATK with VQSR which reuired resources bundle folder which are not available for my organisms plant ( 5 samples for Cardamine hirsuta and 197 samples for Arabidopsis thaliana ) so only can do Hard-filtering however I need to plot my data to see my original distribution so I can determine thresholds which need to be filtered by considering lower est thresholds ... so I need to grep only INFO col first then how to select DP ,VDB ;AF1;MQ and so one so I can obtain file for each one which will be easier to plot then in R ...

Thanks in advance

ADD COMMENTlink written 3.6 years ago by Begali0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1516 users visited in the last hour