SpliceAI custom gene annotation file
1
0
Entering edit mode
3 months ago
ClkElf ▴ 50

Dear all, I would like to use SpliceAI for a set of variants, but their default gene annotation file (grch38.txt) is created from GENCODE v24 canonical annotation files. Since GENCODE v47 is the latest, I would like to create a gene annotation file from it based on the SpliceAI default file. My problem is that GENCODE gtf/gff3 files are transcript-based, unlike grch38.txt, which is gene-based. As an example, you can see below the structure of gencode.v47.basic.annotation.gtf and grch38.txt, respectively.

gencode.v47.basic.annotation.gtf

##description: evidence-based annotation of the human genome (GRCh38), version 47 (Ensembl 113)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2024-07-19
chr1    HAVANA  gene    11121   24894   .       +       .       gene_id "ENSG00000290825.2"; gene_type "lncRNA"; gene_name "DDX11L16"; level 2; tag "overlaps_pseudogene";
chr1    HAVANA  transcript      11426   14409   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; level 2; tag "basic"; tag "TAGENE";
chr1    HAVANA  exon    11426   11671   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 1; exon_id "ENSE00004248702.1"; le>
chr1    HAVANA  exon    12010   12227   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 2; exon_id "ENSE00004248735.1"; le>
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 3; exon_id "ENSE00003582793.1"; le>
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000290825.2"; transcript_id "ENST00000832828.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-264"; exon_number 4; exon_id "ENSE00004248703.1"; le>
chr1    HAVANA  gene    12010   13670   .       +       .       gene_id "ENSG00000223972.6"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";

grch38.txt

#NAME   CHROM   STRAND  TX_START        TX_END  EXON_START      EXON_END
OR4F5   1       +       69090   70008   69090,  70008,
OR4F16  1       -       685715  686654  685715, 686654,
SAMD11  1       +       925737  944575  925737,925921,930154,931038,935771,939039,939274,941143,942135,942409,942558,943252,943697,943907,      925800,926013,930336,931089,935896,939129,939460,941306,942251,942488,943058,943377,943808,944575,

My question is: Is there any way to create a gene-based file from transcript-based gtf/gff3 by keeping unique genes? I need to collapse transcripts into genes somehow...

Thank you very much in advance for your help.

GENCODE SpliceAI • 358 views
ADD COMMENT
1
0
Entering edit mode

Dear benformatics, Thank you very much for your comment. I tried that script previously, but the output is transcript-level, not gene-level. Also, when I changed the transcript names with gene names, then there were many duplicates, unlike the SpliceAI default file.

Here is the output of the script above:

#NAME   CHROM   STRAND  TX_START    TX_END  EXON_START  EXON_END
ENST00000450305.2   chr1    +   12009   13670   12009,12178,12612,12974,13220,13452,    12057,12227,12697,13052,13374,13670,
ENST00000488147.2   chr1    -   14695   24886   14695,14969,15795,16606,16857,17232,17605,17914,18267,24737,    14829,15038,15947,16765,17055,17368,17742,18061,18366,24886,

Best

ADD REPLY

Login before adding your answer.

Traffic: 3394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6