Question

_1 in gene_ids of human T2T assemby gtf file

0

Entering edit mode

22 months ago

grant.hovhannisyan ★ 2.6k

Hi all,

Any idea why the gene_ids of NCBI's gtf file of T2T human genome assembly have "_1" in the end?

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz

NC_060925.1     BestRefSeq      gene    52979   54612   .       -      .       gene_id "LOC101928626_1"; transcript_id ""; db_xref "GeneID:101928626"; description "uncharacterized LOC101928626"; gbkey "Gene"; gene "LOC101928626"; gene_biotype "lncRNA"; 
NC_060925.1     BestRefSeq      transcript      52979   54612   .       -       .      gene_id "LOC101928626_1"; transcript_id "NR_125957.1"; db_xref "GeneID:101928626"; exception "annotated by transcript or proteomic data"; gbkey "ncRNA"; gene "LOC101928626"; inference "similar to RNA sequence (same species):RefSeq:NR_125957.1"; note "The RefSeq transcript has 2 substitutions, 1 non-frameshifting indel compared to this genomic sequence"; product "uncharacterized LOC101928626"; transcript_biotype "lnc_RNA"; 
NC_060925.1     BestRefSeq      exon    54522   54612   .       -       .       gene_id "LOC101928626_1"; transcript_id "NR_125957.1"; db_xref "GeneID:101928626"; exception "annotated by transcript or proteomic data"; gene "LOC101928626"; inference "similar to RNA sequence (same species):RefSeq:NR_125957.1"; note "The RefSeq transcript has 2 substitutions, 1 non-frameshifting indel compared to this genomic sequence"; product "uncharacterized LOC101928626"; transcript_biotype "lnc_RNA"; exon_number "1"; 
NC_060925.1     BestRefSeq      gene    111940  112877  .       -      .       gene_id "OR4F29_1"; transcript_id ""; db_xref "GeneID:729759"; db_xref "HGNC:HGNC:31275"; description "olfactory receptor family 4 subfamily F member 29"; gbkey "Gene"; gene "OR4F29"; gene_biotype "protein_coding"; gene_synonym "OR7-21"; 
NC_060925.1     BestRefSeq   transcript      111940  112877  .       -       .       gene_id "OR4F29_1"; transcript_id "NM_001005221.2"; db_xref "GeneID:729759"; exception "annotated by transcript or proteomic data"; gbkey "mRNA"; gene "OR4F29"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001005221.2"; note "The RefSeq transcript has 9 substitutions, 1 frameshift compared to this genomic sequence"; product "olfactory receptor family 4 subfamily F member 29"; tag "RefSeq Select"; transcript_biotype "mRNA";

It breaks some analyses for GO enrichment/GSEA. Is it safe just to remove these underscores?

cheers

gtf T2T • 964 views

ADD COMMENT • link updated 21 months ago by vkkodali_ncbi ★ 3.7k • written 22 months ago by grant.hovhannisyan ★ 2.6k

score 3 · Accepted Answer · 2022-06-23

Thank you for bringing this up! The _# suffix is a counter added to ensure uniqueness, but an unanticipated outcome of how the data were processed by our pipeline resulted in the counter being applied excessively, and we’re working on a fix. Unfortunately, it’s not as simple as universally dropping all _# suffixes because some genes are intentionally annotated in multiple locations (e.g. chrX & Y in the PAR region). Dropping specifically the _1 suffixes is largely ok. Or if you can leave the GTF file as-is and rely on either the gene=ABCD attribute or drop the _# suffix in a post-processing step before doing the GO enrichment/GSEA analysis, that would be most reliable.

UPDATE (07-13-2022): The files on the FTP are now fixed.