Question

Problems Extracting Non-Snps From A Vcf File

0

Entering edit mode

12.5 years ago

GPR ▴ 390

Hello,

In an SNP analysis, I am trying to extract those editing sites no found in the dbSNPs vcf file I have downloaded a couple of files (All SNPs and Common/Medical SNPs) from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF.

Following this, I have compared my VarScan *.vcf outputs with the SNP.vcf ones using 3 different approaches:

VarScan compare input.vcf SNP.vcf unique1 input-SNPvcf 

bedtools intersect -v -a input.vcf -b SNP.vcf > input-SNP.vcf 

bedops --not-element-of -1  input-sorted.bed SNP-sorted.bed > inputs-sorted-SNP.bed

In all 3 cases, the SNP-output is identical to the input.vcf/bed.

These command-lines however work when I use an alu.bed or a repeat-masker-bed.

Is it just that my analysis contains no known SNPs? I have discarded for obvious reasons.

Can somebody point a the reason/solution to this problem?

Thanks, G.

varscan bedtools dbsnp • 3.5k views

ADD COMMENT • link updated 12.5 years ago by Istvan Albert 102k • written 12.5 years ago by GPR ▴ 390

2

Entering edit mode

in general I recommend a "defensive style" of data analysis - first try to verify that the commands as invoked do what you think they should be doing. For example add a new, artificial entry to your input file and see wether you can find it via your commands.

ADD REPLY • link 12.5 years ago by Istvan Albert 102k

2

Entering edit mode

Do the chromosome names match? In particular, is one file using 'chr' prefix and the other not?

ADD REPLY • link 12.5 years ago by Sean Davis 27k

1

Entering edit mode

That's actually the reason of my despair. The dbSNPs file doesn't have the 'chr' prefix. Is there a way of adding the 'chr' prefix to the SNP.vcf?

ADD REPLY • link 12.5 years ago by GPR ▴ 390

2

Entering edit mode

Just use awk or perl or another scripting tool to edit the first field of the non-comment lines in your VCF file.

ADD REPLY • link 12.5 years ago by Alex Reynolds 36k

score 2 · Answer 1 · 2013-01-30

2

Entering edit mode

12.5 years ago

dankoboldt ▴ 140

GPR, thanks for the question. Sean and Alex, thank you (again) for answering!

First of all, here's a quick command to strip a leading "chr" from a VCF file (this actually edits the file): perl -pi -e 's/chr//' my.vcf

Alternatively you could do so while creating a new file perl -pe 's/chr//' my.vcf >new.vcf

I'd actually recommend a different tool for comparing variant calls to dbSNP using VCF files. That tool (also developed here) is called joinx, and it's available on GitHub:

http://gmt.genome.wustl.edu/joinx/1.6/index.html

The joinx "vcf-annotate" command will use the information in dbSNP to update the ID and INFO fields of your variant VCF... then you have a new VCF in which known SNPs are marked with dbSNP information, which is broadly useful.

ADD COMMENT • link 12.5 years ago by dankoboldt ▴ 140

0

Entering edit mode

Wow, you went on an major answering blitz - no VarScan question left behind! Awesome. And thanks a lot.

ADD REPLY • link 12.5 years ago by Istvan Albert 102k

0

Entering edit mode

Just saw this. Thanks for the recommendation. I actually ended up using awk to change the Chr IDs. I used bedtools to intersect difference and it seems fine. I will however try your suggestion.

ADD REPLY • link 12.4 years ago by GPR ▴ 390