What is the best way to get a set of all of the genes for a genome via entrez?
1
0
Entering edit mode
3.7 years ago
DNAlias ▴ 40

What is the best way to get a set of all of the genes for a genome via entrez?

I tried using the term "GCF_009858895[AACC]" for the "gene" database, but got no hits.

entrez • 846 views
ADD COMMENT
2
Entering edit mode
3.7 years ago
GenoMax 141k

Using Entrezdirect:

A. If you just need names

$ esearch -db assembly -query "GCF_009858895" | elink -target genome | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description
ORF1ab  ORF1a polyprotein;ORF1ab polyprotein
S   surface glycoprotein
N   nucleocapsid phosphoprotein
ORF7a   ORF7a protein
ORF6    ORF6 protein
ORF3a   ORF3a protein
ORF7b   ORF7b
ORF10   ORF10 protein
M   membrane glycoprotein
E   envelope protein
ORF8    ORF8 protein

B. If you want additional information

$ esearch -db assembly -query "GCF_009858895" | elink -target genome | elink -target gene | efetch -format tabular
tax_id  Org_name    GeneID  CurrentID   Status  Symbol  Aliases description other_designations  map_location    chromosome  genomic_nucleotide_accession.versionstart_position_on_the_genomic_accession end_position_on_the_genomic_accession   orientation exon_count  OMIM
2697049 Severe acute respiratory syndrome coronavirus 2 43740578    0   live    ORF1ab  GU280_gp01  ORF1a polyprotein;ORF1ab polyprotein    ORF1a polyprotein;ORF1ab polyprotein            NC_045512.2 266 21555   plus    0
2697049 Severe acute respiratory syndrome coronavirus 2 43740568    0   live    S   GU280_gp02, spike glycoprotein  surface glycoprotein    surface glycoprotein        NC_045512.2 21563   25384   plus    0
2697049 Severe acute respiratory syndrome coronavirus 2 43740575    0   live    N   GU280_gp10  nucleocapsid phosphoprotein nucleocapsid phosphoprotein     NC_045512.2 28274   29533   plus    0
2697049 Severe acute respiratory syndrome coronavirus 2 43740573    0   live    ORF7a   GU280_gp07  ORF7a protein   ORF7a protein           NC_045512.2 2739427759  plus    0
2697049 Severe acute respiratory syndrome coronavirus 2 43740572    0   live    ORF6    GU280_gp06  ORF6 protein    ORF6 protein            NC_045512.2 2720227387  plus    0
2697049 Severe acute respiratory syndrome coronavirus 2 43740569    0   live    ORF3a   GU280_gp03  ORF3a protein   ORF3a protein           NC_045512.2 2539326220  plus    0
2697049 Severe acute respiratory syndrome coronavirus 2 43740574    0   live    ORF7b   GU280_gp08  ORF7b   ORF7b           NC_045512.2 27756   27887   plus0
2697049 Severe acute respiratory syndrome coronavirus 2 43740576    0   live    ORF10   GU280_gp11  ORF10 protein   ORF10 protein           NC_045512.2 2955829674  plus    0
2697049 Severe acute respiratory syndrome coronavirus 2 43740571    0   live    M   GU280_gp05  membrane glycoprotein   membrane glycoprotein           NC_045512.2 26523   27191   plus    0
2697049 Severe acute respiratory syndrome coronavirus 2 43740570    0   live    E   GU280_gp04  envelope protein    envelope protein            NC_045512.2 26245   26472   plus    0
2697049 Severe acute respiratory syndrome coronavirus 2 43740577    0   live    ORF8    GU280_gp09  ORF8 protein    ORF8 protein            NC_045512.2 2789428259  plus    0
ADD COMMENT
0
Entering edit mode

So does this mean that you are extracting the genes from the genome entry instead of the assembly entry? There are some assemblies that don't seem to have links to genomes.

ADD REPLY
1
Entering edit mode

In this case the accession you provided is for an assembly. So linking it to a genome and then get the genes worked. If assemblies don't have genome links and/or have no annotation then you can't get the gene names for those.

ADD REPLY

Login before adding your answer.

Traffic: 1960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6