Question

Annotating genes from the coordinates of variants of a VCF file

0

Entering edit mode

2.3 years ago

ManuelDB ▴ 80

I will need to annotate genes from the coordinates of CNV I have in a VCF file. I am a bit new doing this so I explain here my approach and problem. Happy to receive feedback from both things

I have downloaded a gff file from here http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/ Homo_sapiens.GRCh38.105.chr.gtf contains all gene names and coordinates.

Following the gffutils documentation , I need to convert this file into a db. Here is where I am blocked.

The first lines of my line are like this

1       ensembl_havana  gene    1211340 1214153 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1       ensembl_havana  transcript      1211340 1214153 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; tag "basic"; transcript_support_level "1 (assigned to previous version 3);

Gene name is what I need and it is in the last column (attributes). Documentation says

`
>>> fn = gffutils.example_filename('ensembl_gtf.txt')
>>> db = gffutils.create_db(fn, ":memory:",
... id_spec={'gene': 'gene_id', 'transcript': "transcript_id"},
... merge_strategy="create_unique",
... transform=transform_func,
... keep_order=True)

`

I don't get any error but when I try to find the gene by the name

db["TNFRSF4"]

I get this error

FeatureNotFoundError Traceback (most recent call last)

<ipython-input-49-5186f0b4b314> in <module> ----> 1 db["TNFRSF4"]

~/opt/anaconda3/envs/RP/lib/python3.6/site-packages/gffutils/interface.py in __getitem__(self, key) 278 # TODO: raise error if more than one key is found 279 if results is None: --> 280 raise FeatureNotFoundError(key) 281 return self._feature_returner(**results) 282

FeatureNotFoundError: TNFRSF4

gffutils • 1.2k views

ADD COMMENT • link 2.3 years ago by ManuelDB ▴ 80

score 1 · Answer 1 · 2021-12-31

Via BEDOPS convert2bed and bedmap, the following command-line approach will tell you the names of genes overlapping each CNV, contained in some file called cnvs.vcf:

wget -qO- http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.chr.gtf \
    | awk -v FS='\t' -v OFS='\t' '($3 == "gene")' \
    | convert2bed --input="gtf" --attribute-key="gene_name" - \
    | bedmap --echo --echo-map-id-uniq <(convert2bed --input="vcf" < cnvs.vcf) - \
    > answer.bed

If you want identifiers of CNVs that overlap each gene, you would simply reverse the order of arguments to bedmap:

wget -qO- http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.chr.gtf \
    | awk -v FS='\t' -v OFS='\t' '($3 == "gene")' \
    | convert2bed --input="gtf" --attribute-key="gene_name" - \
    | bedmap --echo --echo-map-id - <(convert2bed --input="vcf" < cnvs.vcf) \
    > answer.bed

If you need to do this in Python, subprocess could be an option.