Question: How To Separate Snp Variants From Indel Variants In The Same Vcf File
gravatar for Jianfengmao
9.0 years ago by
Jianfengmao310 wrote:

Background: Usually we grouped the genomic variants into different types, like SNP, insertion, deletion, Transposable element.

I would like to get some primary statistics, like frequency or counts, for different types of such genomic variants, by vcftools. And also I want to export this genomic variant data for population genetic studies, for example, exporting allele frequency data only for SNPs or Indels.

My question: I want to know if there are tools/strategies to divide SNP variants from indel variants in the same VCF file (only snp and indel there), and keep them into different vcf files. I do my study depending on VCF and VCFTools.

I think your suggestions are really valuable for me and who are depending on VCF format. Thanks in advance.

This question has ever been asked in VCFTools-help mailing list. But, I have not gotten any replies.

vcf vcftools • 9.7k views
ADD COMMENTlink modified 10 months ago by rodd90 • written 9.0 years ago by Jianfengmao310
gravatar for Pierre Lindenbaum
9.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum127k wrote:


four years later, This awk script wouldn't work with multiple ALT alleles. Now, I would use bcftools filter with TYPE=snp or my program VCFFilterJS

you could use the following AWK script:

/^#/    {
    print $0 > "snv.vcf";
    print $0 > "indels.vcf";

/^[^\t]+\t[0-9]+\t[^\t]*\t[atgcATGC]\t[a-zA-Z]\t/   {
    print $0 > "snv.vcf";

    print $0 > "indels.vcf";

The script saves the SNVs and the indels in two distinct files snv.vcf and indels.vcf.

The headers are saved in both files.

If the line has a reference and a alternate base which is a single nucleotide, then save the line to snv.vcf else save it to indels.vcf

awk -f file.awk file.vcf
ADD COMMENTlink modified 4 months ago by RamRS26k • written 9.0 years ago by Pierre Lindenbaum127k

Pierre Lindenbaum, I learned much from your script. I am now learning sed and awk, your scripts enlightened me. Thanks a lot.

ADD REPLYlink written 9.0 years ago by Jianfengmao310

precisely. the script I was thinking on looks exactly like this one.

ADD REPLYlink written 9.0 years ago by Jorge Amigo11k
gravatar for Jorge Amigo
9.0 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

from the VCF specs I would say that you only have to look for single base changes to detect SNPs, considering the rest as INDELs. as described on the bottom of this page SNPs would be the only variations with a single base on both REF and ALT columns, because INDELs would have either REF or ALT column a multi-base string. I think that if you just look for string lengths on those columns then you would have a sufficient filter, and although I haven't tried through vcftools it should be straightforward to script such filter and to divide your VCF file in two.

ADD COMMENTlink written 9.0 years ago by Jorge Amigo11k

Dear Jorge Amigo, Thanks a lot. I am not not at programming, so I asked such a simple question here, but I have begun to learn programming. Thanks for your kind directions.

ADD REPLYlink written 9.0 years ago by Jianfengmao310

for this matter you may use Pierre's awk script directly, and you will have the desired results. for your future work I would suggest you to continue learning some awk basics that will surely help you to implement very useful large file parsings with almost no hassle.

ADD REPLYlink written 9.0 years ago by Jorge Amigo11k

Yes, I have benefit much from learning sed and awk. And, Pierre's script enlightened me, I think I have made a great jump by following Pierre's scripts. Thank you all.

ADD REPLYlink written 9.0 years ago by Jianfengmao310
gravatar for rodd
10 months ago by
London, United Kingdom
rodd90 wrote:

You can separate single nucleotide variants from indels using vcftools and the flags --keep-only-indels or --remove-indels

vcftools --vcf input_file_containing_all_variants.vcf --remove-indels --recode --recode-INFO-all --out output_file_with_indels_removed.vcf
ADD COMMENTlink modified 10 months ago • written 10 months ago by rodd90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1018 users visited in the last hour