Hello,
I have a VCF file with SNPs and BED file with genes. I would like to extract only SNPs that reside on the genes. With the script below I could achieve this goal. However, in the output file some SNPs get doubled then and some are not:
chr1B 812345 . G A 214.635
chr1B 812345 . G A 214.635
chr1B 812413 . A T 601.948
chr1B 812413 . A T 601.948
chr1B 824119 . T C 349.381
chr1B 824169 . T C 241.202
chr1B 824175 . C T 225.089
chr1B 824186 . C G 1480.12
chr1B 824591 . T C 164.335
This is very strange, do you think there is some bug there? How to solve this? I am using bcftools version: 1.10.2-27-g9d66868 and bedtools version: v2.29.2
(SAMPLES,) =glob_wildcards('/path/to/{sample}.vcf.gz')
rule all:
input:
expand("{sample}.flanking.vcf", sample=SAMPLES)
rule bedtools:
input:
invcf="/path/to/{sample}.vcf.gz",
bedgz="/path/to/input.bed.gz"
output:
outvcf="{sample}.flanking.vcf"
shell:
"""
/Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -header -wa | /Tools/bcftools/bcftools annotate -a {input.bedgz} -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') > {output.outvcf}
"""
oh, I use annotate only to append the gene name in the INFO field...
The gene is exactly the same, the scores everything
I can see that the gene is the same after you use bcftools annotate, but is it the same using bedtools interesct -wa -wb ? It shouldn't be the same then.
you are right, there are overlapping genes, basically -wa -wb did not show anything, but I checked the positions on duplicated SNPs manually and see that the genes are overlapping (because I am using flanking +/- 1kb). Any ideas how to solve this? :/
Sorry, I just realized it is the last column showing the genes! and it is showing the right genes, but I need this information (the name of the gene) to be inside the INFO field......how can I do that?