Question

how to download gene annotation for NCBI GENE ID

1

Entering edit mode

11 weeks ago

Bioinfonext ▴ 480

Hi,

I do have a text file for NCBI gene id like below; I would like to extract gene name for these IDs, Could anyone please help how I can get this.

LOC139835107
LOC127318759
LOC139835425
LOC139830140
LOC127318761
LOC127318766
LOC127318756
LOC127318755
LOC127318765
LOC127318767
LOC127340762

Many thanks,

R entrez NCBI • 11k views

ADD COMMENT • link updated 11 weeks ago by josev.die ▴ 70 • written 11 weeks ago by Bioinfonext ▴ 480

score 2 · Answer 1 · 2025-08-15

2

Entering edit mode

11 weeks ago

Gordon Smyth ★ 8.5k

The names given in your question are already official NCBI gene symbols, as can be seen from the NCBI website: https://www.ncbi.nlm.nih.gov/gene/?term=LOC139835107+LOC127318759+LOC139835425

The Entrez Gene IDs are the same but with the "LOC" prefix removed.

You can download detailed gene-level annotation, such as it is, for these genes (and for all plant genes) from this file:

https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Plants/All_Plants.gene_info.gz

which is a gzipped tab-delimited text file.

ADD COMMENT • link 11 weeks ago by Gordon Smyth ★ 8.5k

3

Entering edit mode

For some of the ID's included in the original question, the information in the file linked above.

#tax_id GeneID  Symbol  LocusTag        Synonyms        dbXrefs chromosome      map_location    description     type_of_gene    Symbol_from_nomenclature_authority      Full_name_from_nomenclature_authority   Nomenclature_status     Other_designations    Modification_date       Feature_type
4522    139835425       LOC139835425    -       -       -       1       -       uncharacterized LOC139835425    ncRNA   -       -       -       -       20250306        -
4522    139830140       LOC139830140    -       -       -       1       -       uncharacterized LOC139830140    ncRNA   -       -       -       -       20250306        -
4522    127318761       LOC127318761    -       -       -       1       -       protein MEI2-like 3     protein-coding  -       -       -       protein MEI2-like 3     20250306        -
4522    127318766       LOC127318766    -       -       -       1       -       uncharacterized LOC127318766    protein-coding  -       -       -       uncharacterized protein LOC127318766    20250306        -
4522    127318756       LOC127318756    -       -       -       1       -       uncharacterized LOC127318756    ncRNA   -       -       -       -       20250306        -
4522    127318755       LOC127318755    -       -       -       1       -       protein NEN4    protein-coding  -       -       -       protein NEN4    20250306        -

ADD REPLY • link 11 weeks ago by GenoMax 154k

score 1 · Answer 2 · 2025-08-15

LOC accessions represent uncharacterized genes so you are not going to find gene names or annotations for some or all of these. Best you can do with these ID's noted in this thread --> Gene starts with "LOC" prefix ?

$ more list
LOC139835425
LOC139830140
LOC127318761
LOC127318766
LOC127318756
LOC127318755

$ for i in `cat list`; do esearch -db gene -query ${i} | esummary | xtract -pattern DocumentSummary -element Id,Name,Description,NomenclatureSymbol; done

139835425   LOC139835425    uncharacterized LOC139835425
139830140   LOC139830140    uncharacterized LOC139830140
127318761   LOC127318761    protein MEI2-like 3
127318766   LOC127318766    uncharacterized LOC127318766
127318756   LOC127318756    uncharacterized LOC127318756
127318755   LOC127318755    protein NEN4

score 0 · Answer 3 · 2025-08-15

The number after LOC corresponds to the Entrez ID of a gene. You can read more about the conventions here.

library(rentrez)

loc_list <- list( loc_ids = c("LOC139835107",
                              "LOC127318759",
                              "LOC139835425",
                              "LOC139830140",
                              "LOC127318761",
                              "LOC127318766",
                              "LOC127318756",
                              "LOC127318755",
                              "LOC127318765",
                              "LOC127318767",
                              "LOC127340762"))

df <- as.data.frame(loc_list)

# extract the gene ids
df$e_ids <- gsub("^LOC", "", df$loc_ids)

#setup empty columns
df$gene_desc <- NA
df$genetic_src <- NA
df$chr <- NA
df$organism <- NA

for(i in 1:length(df$e_ids)){
  e_id <- df$e_ids[i]
  esummary <- entrez_summary(db = "gene", id = e_id)
  df$gene_desc[i] <- esummary$description
  df$genetic_src[i] <- esummary$geneticsource
  df$chr[i] <- esummary$chromosome
  df$organism[i] <- esummary$organism$scientificname
}
head(df)

score 0 · Answer 4 · 2025-08-19

Using the refseq_description function with the refseqR package:

# Load library
library(refseqR)

# Get gene description 
loc <- "LOC139835107" 
refseq_description(loc)
[1] "uncharacterized LOC139835107"

loc <- "LOC127318761"
refseq_description(loc)
[1] "protein MEI2-like 3"

The apply family of functions can be used to extract all results in a single step.

loc <- c("LOC139835107", "LOC127318759", "LOC139835425", "LOC139830140",
 "LOC127318761", "LOC127318766", "LOC127318756", "LOC127318755",
 "LOC127318765", "LOC127318767", "LOC127340762")

sapply(loc, function(i) refseq_description(i), USE.NAMES = FALSE)
[1] "uncharacterized LOC139835107" "uncharacterized LOC127318759"
[3] "uncharacterized LOC139835425" "uncharacterized LOC139830140"
[5] "protein MEI2-like 3"          "uncharacterized LOC127318766"
[7] "uncharacterized LOC127318756" "protein NEN4"                
[9] "uncharacterized LOC127318765" "uncharacterized LOC127318767"
[11] "uncharacterized LOC127340762"