summarize SNPs and indels information in vcf file
5
1
Entering edit mode
7.0 years ago
Kurban ▴ 200

through the vcftools i got a file(my.var-final.vcf 27 MB) which contain in formation of SNPs and indels:

##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) tha
n in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples
.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT my-sorted.bam
comp904_c0_seq1 30 . G T 73.5 . DP=4;VDB=0.0014;AF1=1;AC1=2;DP4=0,0,4,0;MQ=60;FQ=-39 GT:PL:GQ 1/1:106,
12,0:21
comp904_c0_seq1 37 . C T 52 . DP=4;VDB=0.0014;AF1=1;AC1=2;DP4=0,0,3,0;MQ=60;FQ=-36 GT:PL:GQ 1/1:84,9
,0:16
comp904_c0_seq1 41 . A T 64.3 . DP=6;VDB=0.0020;AF1=1;AC1=2;DP4=0,0,5,0;MQ=60;FQ=-42 GT:PL:GQ 1/1:97,1
5,0:27
comp904_c0_seq1 74 . A G 4.77 . DP=21;VDB=0.0147;AF1=0.4999;AC1=1;DP4=10,5,3,1;MQ=60;FQ=6.99;PV4=1,1.2e-06,1,1
GT:PL:GQ 0/1:33,0,255:33
comp904_c0_seq1 133 . G T 137 . DP=36;VDB=0.0404;AF1=0.5;AC1=1;DP4=2,3,19,10;MQ=60;FQ=33;PV4=0.35,1.6e-09,1,1
GT:PL:GQ 0/1:167,0,60:63

this there any way to summarize this variation information, like some tools, scripts or something?

snp • 7.0k views
0
Entering edit mode

thank you guys

2
Entering edit mode
7.0 years ago

There are tons of tools that will give you what you want. Here is one: http://vcftools.sourceforge.net/documentation.html#file

0
Entering edit mode

Dear sir. Kronenberg!

the data set i hava analized is transcriptome data, and checked the tools u have recommended :http://vcftools.sourceforge.net/perl_module.html

and the commend i have used is this:

kurban@kurban-X550VC:~/Desktop/SNPs/CD$/home/kurban/Downloads/vcftools_0.1.12b/bin/vcf-indel-stats < my.var-final.vcf > out.txt  and the terminal result shows this: Use of uninitialized value in pattern match (m//) at /home/kurban/Downloads/vcftools_0.1.12b/bin/vcf-indel-stats line 49. Use of uninitialized value in concatenation (.) or string at /home/kurban/Downloads/vcftools_0.1.12b/bin/vcf-indel-stats line 49. <: No such file or directory at /home/kurban/Downloads/vcftools_0.1.12b/bin/vcf-indel-stats line 18 main::error('<: No such file or directory') called at /home/kurban/Downloads/vcftools_0.1.12b/bin/vcf-indel-stats line 50 main::init_regions('HASH(0x84c998)') called at /home/kurban/Downloads/vcftools_0.1.12b/bin/vcf-indel-stats line 71 main::do_stats('HASH(0x84c998)') called at /home/kurban/Downloads/vcftools_0.1.12b/bin/vcf-indel-stats line 9  and I have located the vcf-indel-stats before run the commend and it gives it's location: kurban@kurban-X550VC:~/Desktop/SNPs/CD$ locate vcf-indel-stats
/home/kurban/.local/share/Trash/files/vcftools_0.1.12b/bin/vcf-indel-stats
/home/kurban/.local/share/Trash/files/vcftools_0.1.12b/perl/vcf-indel-stats
/home/kurban/.local/share/Trash/files/vcftools_0.2.1.12b/bin/vcf-indel-stats
/home/kurban/.local/share/Trash/files/vcftools_0.2.1.12b/perl/vcf-indel-stats


I do not know where did i go wrong, could u please give the comment i added a look and give me some corrections!

best regards

kurban

0
Entering edit mode
7.0 years ago
EagleEye 7.0k

This post might be helpful to you:

Capturing clusters having T to C mutation

You can have a summary of SNP's using this script which generate graph and also table for each VCF 4.0 (must be one VCF per sample).

https://github.com/santhilalsubhash/TransExtract_betaV1.2 (Wiki needed to be improved).

0
Entering edit mode
7.0 years ago

Have you had a look at Variant Effect Predictor? (http://www.ensembl.org/info/docs/tools/vep/index.html)

I have used it for SNP calling, I believe it reports indels as well. If you have genomic coordinates it will simply match these against the reference genome for your species and report differences, while predicting the effect of these variations at the mRNA/protein level, etc.

0
Entering edit mode

thank you Natasha, i have searched related info. of the tool u have recommended from the net , it sounds like pretty good tool. but it may sound weird to u,  i am in Urumqi china. here sometimes i could not open some sites, and  connection u have provided( maybe the home page ?)also could not be viewed, i not know why but some times that happens.

best regards

0
Entering edit mode
5.3 years ago

The SNiPlay online pipeline implements VCFtools and allows to summarize statistics information from VCF file: http://sniplay.southgreen.fr/cgi-bin/analysis_v3.cgi

0
Entering edit mode
13 months ago

if you have a large number of VCFs that you're looking to summarize, it might be worthwhile to check out our tool: https://github.com/czbiohub/cerebra