Question: How do we distinguish between SNP/INDEL/SV in 1000 genomes Phase 3 data
1
gravatar for jxiang15
4.6 years ago by
jxiang1510
United States
jxiang1510 wrote:

Hello, 

I'm trying to switch to using the Phase 3 1000 genomes data from Phase I.  In phase I, there was a indicator that said the variant type, so you could for example filter out SNPs easily with a grep command.  However, they remove the below from Phase 3. 

VT=SNP, indicates the variant is a snp.
VT=INDEL, indicates the variant is an indel,
VT=SV, indicates the variant is a deletion.

Anyone know if there's an easy way to filter out the SNPs? Is there another indicator in the file that I'm missing? 

Thanks!

1000 genomes vcf • 3.1k views
ADD COMMENTlink modified 4.6 years ago by Jorge Amigo11k • written 4.6 years ago by jxiang1510

help us, where is the VCF please ?

ADD REPLYlink written 4.6 years ago by Pierre Lindenbaum120k

Here's phase I 

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521

and here's phase 3

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

Thanks

ADD REPLYlink written 4.6 years ago by jxiang1510

Look for "TYPE" tag. 

##INFO=<ID=TYPE,Number=A,Type=String,Description="Type of variant">

 

ADD REPLYlink written 4.6 years ago by Ashutosh Pandey11k

I don't think there's a tag like that in these files. 

ADD REPLYlink written 4.6 years ago by jxiang1510

strangely, there isn't such "TYPE" tag on latest 1000genomes phase3 data (well, it is on the X chromosome).

if you are still willing to build a grep-like query:

for file in ALL.chr*.vcf.gz; do zcat $file | grep -P "\t[ACGT]\t[ACGT]\t" > ${file/.vcf.gz/.snps.vcf.gz}; done

I would go for perl though:

for file in ALL.chr*.vcf.gz; do zcat $file | perl -lane 'print if /\t[ACGT]\t[ACGT]\t/' > ${file/.vcf.gz/.snps.vcf.gz}; done
ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Jorge Amigo11k
4
gravatar for Jorge Amigo
4.6 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

bcftools allows you to filter variants by type using option -v, --types snps|indels|mnps|other (comma-separated list of variant types to select), plus it generates perfectly well-formed vcf output files. for this last reason, and for its great performance (latest HTSlib 1.1 core works like a charm), I would definitely recommend it instead of grep for parsing vcf files. as easy as this simple command:

bcftools view -v snps all.variants.vcf > snps.only.vcf
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Jorge Amigo11k

Thanks, I'm try that.  However, bcftools is probably looking for a tag just like I am.  It would be great to know what it's looking for when doing the filtering.   Also, is there a particular reason you like bcftools instead of vcftools, just curious. 

ADD REPLYlink written 4.6 years ago by jxiang1510

bcftools is faster. it is even stated in the vcftools' Perl tools and API page, and roughly described on a small section of the vcftools site.

ADD REPLYlink written 4.6 years ago by Jorge Amigo11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 807 users visited in the last hour