Annotating genes from the coordinates of variants of a VCF file
1
0
Entering edit mode
2.3 years ago
ManuelDB ▴ 80

I will need to annotate genes from the coordinates of CNV I have in a VCF file. I am a bit new doing this so I explain here my approach and problem. Happy to receive feedback from both things

I have downloaded a gff file from here http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/ Homo_sapiens.GRCh38.105.chr.gtf contains all gene names and coordinates.

Following the gffutils documentation , I need to convert this file into a db. Here is where I am blocked.

The first lines of my line are like this

1       ensembl_havana  gene    1211340 1214153 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1       ensembl_havana  transcript      1211340 1214153 .       -       .       gene_id "ENSG00000186827"; gene_version "11"; transcript_id "ENST00000379236"; transcript_version "4"; gene_name "TNFRSF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "TNFRSF4-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS11"; tag "basic"; transcript_support_level "1 (assigned to previous version 3);

Gene name is what I need and it is in the last column (attributes). Documentation says

`
>>> fn = gffutils.example_filename('ensembl_gtf.txt')
>>> db = gffutils.create_db(fn, ":memory:",
... id_spec={'gene': 'gene_id', 'transcript': "transcript_id"},
... merge_strategy="create_unique",
... transform=transform_func,
... keep_order=True)

`

I don't get any error but when I try to find the gene by the name

db["TNFRSF4"]

I get this error


FeatureNotFoundError Traceback (most recent call last)

<ipython-input-49-5186f0b4b314> in <module> ----> 1 db["TNFRSF4"]

~/opt/anaconda3/envs/RP/lib/python3.6/site-packages/gffutils/interface.py in __getitem__(self, key) 278 # TODO: raise error if more than one key is found 279 if results is None: --> 280 raise FeatureNotFoundError(key) 281 return self._feature_returner(**results) 282

FeatureNotFoundError: TNFRSF4

gffutils • 1.2k views
ADD COMMENT
1
Entering edit mode
2.3 years ago

Via BEDOPS convert2bed and bedmap, the following command-line approach will tell you the names of genes overlapping each CNV, contained in some file called cnvs.vcf:

wget -qO- http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.chr.gtf \
    | awk -v FS='\t' -v OFS='\t' '($3 == "gene")' \
    | convert2bed --input="gtf" --attribute-key="gene_name" - \
    | bedmap --echo --echo-map-id-uniq <(convert2bed --input="vcf" < cnvs.vcf) - \
    > answer.bed

If you want identifiers of CNVs that overlap each gene, you would simply reverse the order of arguments to bedmap:

wget -qO- http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.chr.gtf \
    | awk -v FS='\t' -v OFS='\t' '($3 == "gene")' \
    | convert2bed --input="gtf" --attribute-key="gene_name" - \
    | bedmap --echo --echo-map-id - <(convert2bed --input="vcf" < cnvs.vcf) \
    > answer.bed

If you need to do this in Python, subprocess could be an option.

ADD COMMENT
0
Entering edit mode

Many thanks for your help. I haven't tried it yet because (as I have asked here) thinking better what I want, this will be implemented in Windows machines and your approach will complicate the application I want to develop. I will use it for validation and I will let you know.

By that way, I didn't know convert2bed, I have been reading your git repo a bit and looks great! I will use it in future code

ADD REPLY

Login before adding your answer.

Traffic: 2450 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6