Dear all,
I would like to use SpliceAI for a set of variants, but their default gene annotation file (grch38.txt) is created from GENCODE v24 canonical annotation files. Since GENCODE v47 is the latest, I would like to create a gene annotation file from it based on the SpliceAI default file. My problem is that GENCODE gtf/gff3 files are transcript-based, unlike grch38.txt, which is gene-based. As an example, you can see below the structure of gencode.v47.basic.annotation.gtf
and grch38.txt
, respectively.
gencode.v47.basic.annotation.gtf
##description: evidence-based annotation of the human genome (GRCh38), version 47 (Ensembl 113)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2024-07-19
chr1 HAVANA gene 11121 24894 . + . gene_id "ENSG00000290825.2"; gene_type "lncRNA"; gene_name "DDX11L16"; level 2; tag "overlaps_pseudogene";
chr1 HAVANA transcript 11426 14409 . + . gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; level 2; tag "basic"; tag "TAGENE";
chr1 HAVANA exon 11426 11671 . + . gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 1; exon_id "ENSE00004248702.1"; le>
chr1 HAVANA exon 12010 12227 . + . gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 2; exon_id "ENSE00004248735.1"; le>
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 3; exon_id "ENSE00003582793.1"; le>
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 4; exon_id "ENSE00004248703.1"; le>
chr1 HAVANA gene 12010 13670 . + . gene_id "ENSG00000223972.6"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
grch38.txt
#NAME CHROM STRAND TX_START TX_END EXON_START EXON_END
OR4F5 1 + 69090 70008 69090, 70008,
OR4F16 1 - 685715 686654 685715, 686654,
SAMD11 1 + 925737 944575 925737,925921,930154,931038,935771,939039,939274,941143,942135,942409,942558,943252,943697,943907, 925800,926013,930336,931089,935896,939129,939460,941306,942251,942488,943058,943377,943808,944575,
My question is: Is there any way to create a gene-based file from transcript-based gtf/gff3 by keeping unique genes? I need to collapse transcripts into genes somehow...
Thank you very much in advance for your help.
Dear benformatics, Thank you very much for your comment. I tried that script previously, but the output is transcript-level, not gene-level. Also, when I changed the transcript names with gene names, then there were many duplicates, unlike the SpliceAI default file.
Here is the output of the script above:
Best