Bedtools intersect/bcftools annotate dublicate the SNPs
3
0
Entering edit mode
4.1 years ago
User000 ▴ 690

Hello,

I have a VCF file with SNPs and BED file with genes. I would like to extract only SNPs that reside on the genes. With the script below I could achieve this goal. However, in the output file some SNPs get doubled then and some are not:

chr1B   812345  .   G   A   214.635
chr1B   812345  .   G   A   214.635
chr1B   812413  .   A   T   601.948
chr1B   812413  .   A   T   601.948
chr1B   824119  .   T   C   349.381
chr1B   824169  .   T   C   241.202
chr1B   824175  .   C   T   225.089
chr1B   824186  .   C   G   1480.12
chr1B   824591  .   T   C   164.335

This is very strange, do you think there is some bug there? How to solve this? I am using bcftools version: 1.10.2-27-g9d66868 and bedtools version: v2.29.2

(SAMPLES,) =glob_wildcards('/path/to/{sample}.vcf.gz')
rule all:
    input: 
        expand("{sample}.flanking.vcf", sample=SAMPLES)

rule bedtools:
    input:
        invcf="/path/to/{sample}.vcf.gz",
        bedgz="/path/to/input.bed.gz"
    output:
        outvcf="{sample}.flanking.vcf"
    shell:
        """
        /Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -header -wa | /Tools/bcftools/bcftools annotate -a {input.bedgz} -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') > {output.outvcf}
        """
bcftools bedtools SNP • 1.7k views
ADD COMMENT
1
Entering edit mode
4.1 years ago
User000 ▴ 690

I found a solution here: link. Hope could be useful for the community. So I modified my shell like this:

/Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -u -header -wa | /Tools/bcftools/bcftools annotate -a {input.bedgz} -c CHROM,FROM,TO,GENE -l GENE:append -h <(echo '##INFO=<ID=GENE,Number=.,Type=String,Description="Gene name">') > {output.outvcf}

This would give just one single snp with all the genes that overlap for that snp.

ADD COMMENT
1
Entering edit mode
4.1 years ago

Could it be that some SNP fall onto multiple genes ? Because of overlapping genes or sense/antisense gene pairs. To check this, I would just look at the output of

intersect -a {input.invcf} -b {input.bedgz} -header -wa -wb > out.bed

where the option -wb ensure that the full gene entry of input.bedgz is outputed for every match. Btw I'm not sure to see the point of using bcftools annotate here, when you already matched the SNP with the genes using bedtools.

ADD COMMENT
0
Entering edit mode

oh, I use annotate only to append the gene name in the INFO field...

ADD REPLY
0
Entering edit mode

The gene is exactly the same, the scores everything

ADD REPLY
1
Entering edit mode

I can see that the gene is the same after you use bcftools annotate, but is it the same using bedtools interesct -wa -wb ? It shouldn't be the same then.

ADD REPLY
0
Entering edit mode

you are right, there are overlapping genes, basically -wa -wb did not show anything, but I checked the positions on duplicated SNPs manually and see that the genes are overlapping (because I am using flanking +/- 1kb). Any ideas how to solve this? :/

ADD REPLY
0
Entering edit mode

Sorry, I just realized it is the last column showing the genes! and it is showing the right genes, but I need this information (the name of the gene) to be inside the INFO field......how can I do that?

ADD REPLY
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 1499 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6