Removing indels in VCF file
1
0
Entering edit mode
19 months ago
Colaptes ▴ 100

Hello,

I am trying to do something very simple, but running into confusing behaviour. I have a VCF file of multiple samples and want to remove all indels so that I can generate sequences with identical coordinates with bcftools consensus. I removed indels by specifying

bcftools view --include 'TYPE="snp" || TYPE="ref"' -O z -o newfile.vcf.gz filename.vcf.gz

I expected this to keep only SNPs or invariant sites. However, there are still some indels left in my file, for example, this position still has an indel, which breaks my downstream analyses:

chr_24  805548  .       A       .       14168.2 .       DP=1074;MQ0F=0;AN=102;DP4=550,524,0,0;MQ=59;F_MISSING=0.271429  GT:DP:SP:AD     0/0:7:0:7       0/0:10:0:10     ./.:1:0:1       ./.:1:0:1       ./.:1:0:1       ./.:2:0:2       ./.:1:0:1       0/0:8:0:8       0/0:11:0:11     0/0:29:0:29     0/0:8:0:8       ./.:0:0:0       0/0:57:0:57     0/0:14:0:14     0/0:7:0:7       0/0:6:0:6       0/0:38:0:38     0/0:21:0:21     0/0:9:0:9       0/0:8:0:8       0/0:17:0:17     ./.:2:0:2       0/0:35:0:35     0/0:23:0:23     0/0:13:0:13     ./.:4:0:4       0/0:10:0:10     ./.:3:0:3       0/0:5:0:5       0/0:53:0:53     ./.:3:0:3       ./.:2:0:2       0/0:33:0:33     0/0:48:0:48     0/0:37:0:37     0/0:42:0:42     0/0:35:0:35     0/0:26:0:26     0/0:28:0:28     0/0:40:0:40     ./.:3:0:3       0/0:7:0:7       0/0:27:0:27     0/0:9:0:9       0/0:11:0:11     0/0:10:0:10     ./.:1:0:1       0/0:13:0:13     0/0:6:0:6       0/0:10:0:10     0/0:9:0:9
       ./.:2:0:2       ./.:1:0:1       ./.:0:0:0       0/0:32:0:32     0/0:12:0:12     0/0:9:0:9       0/0:8:0:8       ./.:2:0:2       0/0:6:0:6       ./.:4:0:4       0/0:7:0:7       0/0:15:0:15     0/0:35:0:35     0/0:29:0:29     0/0:7:0:7       0/0:24:0:24     0/0:10:0:10     ./.:4:0:4       0/0:11:0:11
chr_24  805548  .       ATCT    .       35.6358 .       INDEL;IDV=21;IMF=1;DP=1074;VDB=2.39482e-42;SGB=133.525;RPBZ=2.03077;MQBZ=-3.57201;MQSBZ=-1.77777;BQBZ=-1.89605;SCBZ=-1.51717;MQ0F=0;AN=100;DP4=512,464,36,51;MQ=59;F_MISSING=0.285714   GT:DP:SP:AD     0/0:7:0:7       0/0:10:0:3      ./.:1:0:1       ./.:1:0:1       ./.:1:0:0       ./.:2:0:2       ./.:1:0:1       0/0:8:0:6
       0/0:11:0:11     0/0:29:0:29     0/0:8:0:8       ./.:0:0:0       0/0:57:0:57     0/0:14:16:9     0/0:7:0:7       0/0:6:0:6       0/0:38:0:38     0/0:21:0:0      0/0:9:0:9       0/0:8:3:4       0/0:17:0:17     ./.:2:0:2       0/0:35:0:35     0/0:21:0:8      0/0:13:0:13     ./.:4:0:4       ./.:4:0:4       ./.:3:0:3       0/0:5:0:5       0/0:53:0:53     ./.:3:0:3       ./.:2:0:2       0/0:33:0:33     0/0:48:0:48     0/0:37:0:37     0/0:42:0:42     0/0:35:0:35     0/0:26:0:26     0/0:28:0:28     0/0:40:0:40     ./.:3:0:3       0/0:7:0:7       0/0:26:6:10     0/0:9:0:9       0/0:11:0:11     0/0:10:0:10     ./.:1:0:1       0/0:13:0:13     0/0:6:0:6       0/0:10:0:10     0/0:9:0:9       ./.:2:0:2       ./.:1:0:1       ./.:0:0:0       0/0:32:0:32     0/0:12:5:7      0/0:9:0:9       0/0:8:0:8       ./.:2:0:2       0/0:6:0:6       ./.:4:0:4       0/0:7:0:7       0/0:15:0:15     0/0:35:0:35     0/0:29:0:29     0/0:7:0:7       0/0:24:0:13
     0/0:9:0:7       ./.:4:0:4       0/0:11:0:11

Testing TYPE="snp" and TYPE="ref" separately, it seems that it is kept via TYPE="ref". Does anyone know why this indel made it through the --include 'TYPE="ref"' filter, or what I can change to stop it from making it through the filter?

Thank you!

bcftools • 1.5k views
ADD COMMENT
1
Entering edit mode
19 months ago

The problem is that your VCF does not respect the VCF specification. From the spec ( https://samtools.github.io/hts-specs/VCFv4.2.pdf ) , the dot is not allowed for the ALT allele.

you could use awk to remove those variants:

awk -F '\t' '($0 ~ /^#/ || $5!=".")' in.vcf
ADD COMMENT
1
Entering edit mode

Thank you! Strange that the formatting is not proper, I generated the VCF with bcftools and haven't modified it with any other program.

ADD REPLY
0
Entering edit mode

generated the VCF with bcftools

so i may be wrong on that point.

ADD REPLY
0
Entering edit mode

I'm finding the same issue in my data. If Pierre is right, this might be a bug in the program. I've posted an 'issue' request on the developer's GitHub: https://github.com/samtools/bcftools/issues/2299

ADD REPLY
0
Entering edit mode

Maybe it is related with indels left-alignment? I mean the "." in your alt alleles.

ADD REPLY

Login before adding your answer.

Traffic: 2211 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6