_1 in gene_ids of human T2T assemby gtf file
Entering edit mode
2.1 years ago

Hi all,

Any idea why the gene_ids of NCBI's gtf file of T2T human genome assembly have "_1" in the end?


NC_060925.1     BestRefSeq      gene    52979   54612   .       -      .       gene_id "LOC101928626_1"; transcript_id ""; db_xref "GeneID:101928626"; description "uncharacterized LOC101928626"; gbkey "Gene"; gene "LOC101928626"; gene_biotype "lncRNA"; 
NC_060925.1     BestRefSeq      transcript      52979   54612   .       -       .      gene_id "LOC101928626_1"; transcript_id "NR_125957.1"; db_xref "GeneID:101928626"; exception "annotated by transcript or proteomic data"; gbkey "ncRNA"; gene "LOC101928626"; inference "similar to RNA sequence (same species):RefSeq:NR_125957.1"; note "The RefSeq transcript has 2 substitutions, 1 non-frameshifting indel compared to this genomic sequence"; product "uncharacterized LOC101928626"; transcript_biotype "lnc_RNA"; 
NC_060925.1     BestRefSeq      exon    54522   54612   .       -       .       gene_id "LOC101928626_1"; transcript_id "NR_125957.1"; db_xref "GeneID:101928626"; exception "annotated by transcript or proteomic data"; gene "LOC101928626"; inference "similar to RNA sequence (same species):RefSeq:NR_125957.1"; note "The RefSeq transcript has 2 substitutions, 1 non-frameshifting indel compared to this genomic sequence"; product "uncharacterized LOC101928626"; transcript_biotype "lnc_RNA"; exon_number "1"; 
NC_060925.1     BestRefSeq      gene    111940  112877  .       -      .       gene_id "OR4F29_1"; transcript_id ""; db_xref "GeneID:729759"; db_xref "HGNC:HGNC:31275"; description "olfactory receptor family 4 subfamily F member 29"; gbkey "Gene"; gene "OR4F29"; gene_biotype "protein_coding"; gene_synonym "OR7-21"; 
NC_060925.1     BestRefSeq   transcript      111940  112877  .       -       .       gene_id "OR4F29_1"; transcript_id "NM_001005221.2"; db_xref "GeneID:729759"; exception "annotated by transcript or proteomic data"; gbkey "mRNA"; gene "OR4F29"; inference "similar to RNA sequence, mRNA (same species):RefSeq:NM_001005221.2"; note "The RefSeq transcript has 9 substitutions, 1 frameshift compared to this genomic sequence"; product "olfactory receptor family 4 subfamily F member 29"; tag "RefSeq Select"; transcript_biotype "mRNA";

It breaks some analyses for GO enrichment/GSEA. Is it safe just to remove these underscores?


gtf T2T • 1.1k views
Entering edit mode
2.1 years ago
vkkodali_ncbi ★ 3.7k

Thank you for bringing this up! The _# suffix is a counter added to ensure uniqueness, but an unanticipated outcome of how the data were processed by our pipeline resulted in the counter being applied excessively, and we’re working on a fix. Unfortunately, it’s not as simple as universally dropping all _# suffixes because some genes are intentionally annotated in multiple locations (e.g. chrX & Y in the PAR region). Dropping specifically the _1 suffixes is largely ok. Or if you can leave the GTF file as-is and rely on either the gene=ABCD attribute or drop the _# suffix in a post-processing step before doing the GO enrichment/GSEA analysis, that would be most reliable.

UPDATE (07-13-2022): The files on the FTP are now fixed.

Entering edit mode

awesome, thanks!


Login before adding your answer.

Traffic: 1112 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6