Question: How To Separate Snp Variants From Indel Variants In The Same Vcf File
8
gravatar for Jianfengmao
8.0 years ago by
Jianfengmao310
Jianfengmao310 wrote:

Background: Usually we grouped the genomic variants into different types, like SNP, insertion, deletion, Transposable element.

I would like to get some primary statistics, like frequency or counts, for different types of such genomic variants, by vcftools. And also I want to export this genomic variant data for population genetic studies, for example, exporting allele frequency data only for SNPs or Indels.

My question: I want to know if there are tools/strategies to divide SNP variants from indel variants in the same VCF file (only snp and indel there), and keep them into different vcf files. I do my study depending on VCF and VCFTools.

I think your suggestions are really valuable for me and who are depending on VCF format. Thanks in advance.

This question has ever been asked in VCFTools-help mailing list. But, I have not gotten any replies.

vcf vcftools • 8.8k views
ADD COMMENTlink modified 8.0 years ago by Pierre Lindenbaum118k • written 8.0 years ago by Jianfengmao310
19
gravatar for Pierre Lindenbaum
8.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

 

EDIT

four years later, This awk script wouldn't work with multiple ALT alleles. Now, I would use bcftools filter with TYPE=snp or my program VCFFilterJS https://github.com/lindenb/jvarkit/wiki/VCFFilterJS


 

 

you could use the following AWK script:

/^#/    {
    print $0 > "snv.vcf";
    print $0 > "indels.vcf";
    next;
    }

/^[^\t]+\t[0-9]+\t[^\t]*\t[atgcATGC]\t[a-zA-Z]\t/   {
    print $0 > "snv.vcf";
    next;
    }

    {
    print $0 > "indels.vcf";
    next;
    }

The script saves the SNVs and the indels in two distinct files "snv.vcf" and "indels.vcf".

The headers are saved in both files.

If the line has a reference and a alternate base which is a single nucleotide, then save the line to "snv.vcf" else save it to "indels.vcf"

  awk -f file.awk file.vcf
ADD COMMENTlink modified 4.2 years ago • written 8.0 years ago by Pierre Lindenbaum118k
2

Pierre Lindenbaum, I learned much from your script. I am now learning sed and awk, your scripts enlightened me. Thanks a lot.

ADD REPLYlink written 8.0 years ago by Jianfengmao310

precisely. the script I was thinking on looks exactly like this one.

ADD REPLYlink written 8.0 years ago by Jorge Amigo11k


 

ADD REPLYlink modified 3.9 years ago • written 4.2 years ago by modernsynthesis20
2
gravatar for Jorge Amigo
8.0 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

from the VCF specs I would say that you only have to look for single base changes to detect SNPs, considering the rest as INDELs. as described on the bottom of this page SNPs would be the only variations with a single base on both REF and ALT columns, because INDELs would have either REF or ALT column a multi-base string. I think that if you just look for string lengths on those columns then you would have a sufficient filter, and although I haven't tried through vcftools it should be straightforward to script such filter and to divide your VCF file in two.

ADD COMMENTlink written 8.0 years ago by Jorge Amigo11k

Dear Jorge Amigo, Thanks a lot. I am not not at programming, so I asked such a simple question here, but I have begun to learn programming. Thanks for your kind directions.

ADD REPLYlink written 8.0 years ago by Jianfengmao310

for this matter you may use Pierre's awk script directly, and you will have the desired results. for your future work I would suggest you to continue learning some awk basics that will surely help you to implement very useful large file parsings with almost no hassle.

ADD REPLYlink written 8.0 years ago by Jorge Amigo11k

Yes, I have benefit much from learning sed and awk. And, Pierre's script enlightened me, I think I have made a great jump by following Pierre's scripts. Thank you all.

ADD REPLYlink written 8.0 years ago by Jianfengmao310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1356 users visited in the last hour