Question

SpliceAI custom gene annotation file

0

Entering edit mode

3 months ago

ClkElf ▴ 50

Dear all, I would like to use SpliceAI for a set of variants, but their default gene annotation file (grch38.txt) is created from GENCODE v24 canonical annotation files. Since GENCODE v47 is the latest, I would like to create a gene annotation file from it based on the SpliceAI default file. My problem is that GENCODE gtf/gff3 files are transcript-based, unlike grch38.txt, which is gene-based. As an example, you can see below the structure of gencode.v47.basic.annotation.gtf and grch38.txt, respectively.

gencode.v47.basic.annotation.gtf

##description: evidence-based annotation of the human genome (GRCh38), version 47 (Ensembl 113)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2024-07-19
chr1    HAVANA  gene    11121   24894   .       +       .       gene_id "ENSG00000290825.2"; gene_type "lncRNA"; gene_name "DDX11L16"; level 2; tag "overlaps_pseudogene";
chr1    HAVANA  transcript      11426   14409   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; level 2; tag "basic"; tag "TAGENE";
chr1    HAVANA  exon    11426   11671   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 1; exon_id "ENSE00004248702.1"; le>
chr1    HAVANA  exon    12010   12227   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 2; exon_id "ENSE00004248735.1"; le>
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 3; exon_id "ENSE00003582793.1"; le>
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 4; exon_id "ENSE00004248703.1"; le>
chr1    HAVANA  gene    12010   13670   .       +       .       gene_id "ENSG00000223972.6"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";

grch38.txt

#NAME   CHROM   STRAND  TX_START        TX_END  EXON_START      EXON_END
OR4F5   1       +       69090   70008   69090,  70008,
OR4F16  1       -       685715  686654  685715, 686654,
SAMD11  1       +       925737  944575  925737,925921,930154,931038,935771,939039,939274,941143,942135,942409,942558,943252,943697,943907,      925800,926013,930336,931089,935896,939129,939460,941306,942251,942488,943058,943377,943808,944575,

My question is: Is there any way to create a gene-based file from transcript-based gtf/gff3 by keeping unique genes? I need to collapse transcripts into genes somehow...

Thank you very much in advance for your help.

GENCODE SpliceAI • 425 views

ADD COMMENT • link 3 months ago by ClkElf ▴ 50

score 1 · Answer 1 · 2025-02-25

1

Entering edit mode

3 months ago

benformatics 4.1k

https://github.com/broadinstitute/SpliceAI-lookup/blob/master/annotations/convert_gtf_to_SpliceAI_annotation_input_format.py

ADD COMMENT • link 3 months ago by benformatics 4.1k

0

Entering edit mode

Dear benformatics, Thank you very much for your comment. I tried that script previously, but the output is transcript-level, not gene-level. Also, when I changed the transcript names with gene names, then there were many duplicates, unlike the SpliceAI default file.

Here is the output of the script above:

#NAME   CHROM   STRAND  TX_START    TX_END  EXON_START  EXON_END
ENST00000450305.2   chr1    +   12009   13670   12009,12178,12612,12974,13220,13452,    12057,12227,12697,13052,13374,13670,
ENST00000488147.2   chr1    -   14695   24886   14695,14969,15795,16606,16857,17232,17605,17914,18267,24737,    14829,15038,15947,16765,17055,17368,17742,18061,18366,24886,

Best

ADD REPLY • link 3 months ago by ClkElf ▴ 50