how to download gene annotation for NCBI GENE ID
4
0
Entering edit mode
4 weeks ago
Bioinfonext ▴ 470

Hi,

I do have a text file for NCBI gene id like below; I would like to extract gene name for these IDs, Could anyone please help how I can get this.

LOC139835107
LOC127318759
LOC139835425
LOC139830140
LOC127318761
LOC127318766
LOC127318756
LOC127318755
LOC127318765
LOC127318767
LOC127340762

Many thanks,

R entrez NCBI • 11k views
ADD COMMENT
2
Entering edit mode
4 weeks ago
Gordon Smyth ★ 8.3k

The names given in your question are already official NCBI gene symbols, as can be seen from the NCBI website: https://www.ncbi.nlm.nih.gov/gene/?term=LOC139835107+LOC127318759+LOC139835425

The Entrez Gene IDs are the same but with the "LOC" prefix removed.

You can download detailed gene-level annotation, such as it is, for these genes (and for all plant genes) from this file:

https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Plants/All_Plants.gene_info.gz

which is a gzipped tab-delimited text file.

ADD COMMENT
3
Entering edit mode

For some of the ID's included in the original question, the information in the file linked above.

#tax_id GeneID  Symbol  LocusTag        Synonyms        dbXrefs chromosome      map_location    description     type_of_gene    Symbol_from_nomenclature_authority      Full_name_from_nomenclature_authority   Nomenclature_status     Other_designations    Modification_date       Feature_type
4522    139835425       LOC139835425    -       -       -       1       -       uncharacterized LOC139835425    ncRNA   -       -       -       -       20250306        -
4522    139830140       LOC139830140    -       -       -       1       -       uncharacterized LOC139830140    ncRNA   -       -       -       -       20250306        -
4522    127318761       LOC127318761    -       -       -       1       -       protein MEI2-like 3     protein-coding  -       -       -       protein MEI2-like 3     20250306        -
4522    127318766       LOC127318766    -       -       -       1       -       uncharacterized LOC127318766    protein-coding  -       -       -       uncharacterized protein LOC127318766    20250306        -
4522    127318756       LOC127318756    -       -       -       1       -       uncharacterized LOC127318756    ncRNA   -       -       -       -       20250306        -
4522    127318755       LOC127318755    -       -       -       1       -       protein NEN4    protein-coding  -       -       -       protein NEN4    20250306        -
ADD REPLY
0
Entering edit mode
4 weeks ago
GenoMax 153k

LOC accessions represent uncharacterized genes so you are not going to find gene names or annotations for some or all of these. Best you can do with these ID's noted in this thread --> Gene starts with "LOC" prefix ?

$ more list
LOC139835425
LOC139830140
LOC127318761
LOC127318766
LOC127318756
LOC127318755

$ for i in `cat list`; do esearch -db gene -query ${i} | esummary | xtract -pattern DocumentSummary -element Id,Name,Description,NomenclatureSymbol; done

139835425   LOC139835425    uncharacterized LOC139835425
139830140   LOC139830140    uncharacterized LOC139830140
127318761   LOC127318761    protein MEI2-like 3
127318766   LOC127318766    uncharacterized LOC127318766
127318756   LOC127318756    uncharacterized LOC127318756
127318755   LOC127318755    protein NEN4
ADD COMMENT
0
Entering edit mode
4 weeks ago

The number after LOC corresponds to the Entrez ID of a gene. You can read more about the conventions here.

library(rentrez)

loc_list <- list( loc_ids = c("LOC139835107",
                              "LOC127318759",
                              "LOC139835425",
                              "LOC139830140",
                              "LOC127318761",
                              "LOC127318766",
                              "LOC127318756",
                              "LOC127318755",
                              "LOC127318765",
                              "LOC127318767",
                              "LOC127340762"))

df <- as.data.frame(loc_list)

# extract the gene ids
df$e_ids <- gsub("^LOC", "", df$loc_ids)

#setup empty columns
df$gene_desc <- NA
df$genetic_src <- NA
df$chr <- NA
df$organism <- NA

for(i in 1:length(df$e_ids)){
  e_id <- df$e_ids[i]
  esummary <- entrez_summary(db = "gene", id = e_id)
  df$gene_desc[i] <- esummary$description
  df$genetic_src[i] <- esummary$geneticsource
  df$chr[i] <- esummary$chromosome
  df$organism[i] <- esummary$organism$scientificname
}
head(df)
ADD COMMENT
0
Entering edit mode

LOC id's in the original question are not human so this solution will not work in this case. They appear to be mostly for perennial ryegrass (Lolium perenne).

ADD REPLY
0
Entering edit mode

Sorry, I didn't check for the organism. I have updated the code excerpt accordingly.

ADD REPLY
0
Entering edit mode
28 days ago
josev.die ▴ 70

Using the refseq_description function with the refseqR package:

# Load library
library(refseqR)

# Get gene description 
loc <- "LOC139835107" 
refseq_description(loc)
[1] "uncharacterized LOC139835107"

loc <- "LOC127318761"
refseq_description(loc)
[1] "protein MEI2-like 3"

The apply family of functions can be used to extract all results in a single step.

loc <- c("LOC139835107", "LOC127318759", "LOC139835425", "LOC139830140",
 "LOC127318761", "LOC127318766", "LOC127318756", "LOC127318755",
 "LOC127318765", "LOC127318767", "LOC127340762")

sapply(loc, function(i) refseq_description(i), USE.NAMES = FALSE)
[1] "uncharacterized LOC139835107" "uncharacterized LOC127318759"
[3] "uncharacterized LOC139835425" "uncharacterized LOC139830140"
[5] "protein MEI2-like 3"          "uncharacterized LOC127318766"
[7] "uncharacterized LOC127318756" "protein NEN4"                
[9] "uncharacterized LOC127318765" "uncharacterized LOC127318767"
[11] "uncharacterized LOC127340762"
ADD COMMENT

Login before adding your answer.

Traffic: 3758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6