How To Separate Snp Variants From Indel Variants In The Same Vcf File
3
8
Entering edit mode
11.3 years ago
Jianfengmao ▴ 310

Background: Usually we grouped the genomic variants into different types, like SNP, insertion, deletion, Transposable element.

I would like to get some primary statistics, like frequency or counts, for different types of such genomic variants, by vcftools. And also I want to export this genomic variant data for population genetic studies, for example, exporting allele frequency data only for SNPs or Indels.

My question: I want to know if there are tools/strategies to divide SNP variants from indel variants in the same VCF file (only snp and indel there), and keep them into different vcf files. I do my study depending on VCF and VCFTools.

I think your suggestions are really valuable for me and who are depending on VCF format. Thanks in advance.

This question has ever been asked in VCFTools-help mailing list. But, I have not gotten any replies.

vcf vcftools • 13k views
19
Entering edit mode
11.3 years ago

EDIT

four years later, This awk script wouldn't work with multiple ALT alleles. Now, I would use bcftools filter with TYPE=snp or my program VCFFilterJS

you could use the following AWK script:

/^#/    {
print $0 > "snv.vcf"; print$0 > "indels.vcf";
next;
}

/^[^\t]+\t[0-9]+\t[^\t]*\t[atgcATGC]\t[a-zA-Z]\t/   {
print $0 > "snv.vcf"; next; } { print$0 > "indels.vcf";
next;
}


The script saves the SNVs and the indels in two distinct files snv.vcf and indels.vcf.

The headers are saved in both files.

If the line has a reference and a alternate base which is a single nucleotide, then save the line to snv.vcf else save it to indels.vcf

awk -f file.awk file.vcf

2
Entering edit mode

Pierre Lindenbaum, I learned much from your script. I am now learning sed and awk, your scripts enlightened me. Thanks a lot.

0
Entering edit mode

precisely. the script I was thinking on looks exactly like this one.

2
Entering edit mode
11.3 years ago

from the VCF specs I would say that you only have to look for single base changes to detect SNPs, considering the rest as INDELs. as described on the bottom of this page SNPs would be the only variations with a single base on both REF and ALT columns, because INDELs would have either REF or ALT column a multi-base string. I think that if you just look for string lengths on those columns then you would have a sufficient filter, and although I haven't tried through vcftools it should be straightforward to script such filter and to divide your VCF file in two.

0
Entering edit mode

Dear Jorge Amigo, Thanks a lot. I am not not at programming, so I asked such a simple question here, but I have begun to learn programming. Thanks for your kind directions.

0
Entering edit mode

for this matter you may use Pierre's awk script directly, and you will have the desired results. for your future work I would suggest you to continue learning some awk basics that will surely help you to implement very useful large file parsings with almost no hassle.

0
Entering edit mode

Yes, I have benefit much from learning sed and awk. And, Pierre's script enlightened me, I think I have made a great jump by following Pierre's scripts. Thank you all.

2
Entering edit mode
3.2 years ago
rodd ▴ 160

You can separate single nucleotide variants from indels using vcftools and the flags --keep-only-indels or --remove-indels

vcftools --vcf input_file_containing_all_variants.vcf --remove-indels --recode --recode-INFO-all --out output_file_with_indels_removed.vcf