Question

Trying to retrieve set of protein sequences based on gene IDs

0

Entering edit mode

19 months ago

JONATHAN • 0

Hello all,

I have a large dataset of NCBI gene IDs, and need to retrieve their corresponding protein sequences. Are there any data mining tools that can easily do this? Thanks in advance.

gene NCBI protein • 1.6k views

ADD COMMENT • link updated 16 months ago by josev.die ▴ 60 • written 19 months ago by JONATHAN • 0

0

Entering edit mode

dos puntos right quick:

Punto 1: Ensembl is puh-reeetttttyyyy good for this as well; and you can also use UCSC Table Browser. Punto 2: keep in mind that genes are not genes are not genes are not genes due to alternative splicing. as such, it is necessary to specify both a gene name and a transcript isoform in mind in order to describe something unique.

ADD REPLY • link 19 months ago by LauferVA 4.2k

score 1 · Answer 1 · 2022-09-15

Hi,
You can use NCBI Datasets. For example, let's say you have a text file with five NCBI Gene IDs (gene_ids.txt):

You can use list as an input for datasets and download a gene data package. By default, the package includes the protein sequences, as well as transcript and gene sequences (plus metadata). If you want to restrict it to protein only, you can use this command:

datasets download gene gene-id --inputfile gene_ids.txt --exclude-rna --exclude-gene --filename genes.zip

After you unzip the file (I unzipped it to the folder gene_list), you can find all protein isoforms in the file protein.faa. Here's the folder structure:

gene_list/
|-- README.md
`-- ncbi_dataset
    `-- data
        |-- data_report.jsonl
        |-- data_table.tsv
        |-- dataset_catalog.json
        `-- protein.faa

2 directories, 5 files

Let me know if you have any questions. :)

score 0 · Answer 2 · 2022-09-08

Using Entrezdirect this is simple to do:

If you have gene ID's that are numeric (sequence truncated for space):

$ esearch -db gene -query 945768 | elink -target protein | efetch -format fasta
>NP_415777.1 tryptophan synthase subunit beta [Escherichia coli str. K-12 substr. MG1655]
MTTLLNPYFGEFGGMYVPQILMPALRQLEEAFVSAQKDPEFQAQFNDLLKNYAGRPTALTKCQNITAGTN
>AAC74343.1 tryptophan synthase subunit beta [Escherichia coli str. K-12 substr. MG1655]
MTTLLNPYFGEFGGMYVPQILMPALRQLEEAFVSAQKDPEFQAQFNDLLKNYAGRPTALTKCQNITAGTN
TTLYLKREDLLHGGAHKTNQVLGQALLAKRMGKTEIIAETGAGQHGVASALASALLGLKCRIYMGAKDVE

In case you have accession numbers of proteins (put one ID per line in a file):

$  more id
ABA43103.1 
ABA43104.1
ABA43105.1  

$ epost -db protein -input id | efetch -format fasta
>ABA43105.1 nonstructural protein, partial [Norovirus Hu/GI/N9/2003/Irl]
DRNLLPEFVNDDGV
>ABA43104.1 TrpB, partial [Kitasatospora aureofaciens]
NNVLGQALLTRRMGKTRIIAETGAGQHGVATATACALFGFDCTIYMGEVDTERQALNVARMRMLGAEVIA
VKSGSRTLKDAINEAFRDWVANVDSTHYLFGTVAGPHPFPMMVRDFHRIIGVEARQQVLDRTGRLPDAVV
ACVGGGSNAIG
>ABA43103.1 TrpB, partial [Streptomyces lydicus]
NNVLGKALLTKRMGKTRVIAETGAGQHGVATATACALFGLECTIYMGEIDTQRQALNVARMRMLGAEVIA
VKSGSRTLKDAINEAFRDWVANVDRTHYLFGTVAGPHPFPALVRDFHRVIGVEARRQLLERAGRLPDAAL
ACVGGGSNAIG

score 0 · Answer 3 · 2022-11-26

For R users :

#Dependencies
library(rentrez)

Define some functions :

 get_protids_from_geneids <- function(gene_ids) {

 # Get the protein elink.
 protein_elink <- rentrez::entrez_link(dbfrom = "gene", id = gene_ids, db = "protein")

 # Get the protein id (refseq database)
 protein_ids <- protein_elink$links$gene_protein_refseq
 protein_ids
 }

make_aa_fasta <- function(prot_ids, nameFile) {

# Make a multi-fasta file with each protein id
 sapply(prot_ids, function(x) {
 protein_esummary <- rentrez::entrez_summary(db = "protein", id = x)
 protein_fasta <- rentrez::entrez_fetch(db = "protein", id = protein_esummary$uid, rettype = "fasta")
 # save amino acid sequences into a FASTA file ("nameFile"")
 write(protein_fasta, file = paste(nameFile, ".fasta", sep = ""), append = TRUE)
  } )
  }

Following the MirianT_NCBI example : load the NCBI Gene IDs into a vector :

# Define a vector with gene ids 
gene_ids = c('672', '7157', '7124', '348', '7422')

# Get the protein ids 
prot_ids <- get_protids_from_geneids(gene_ids)
#length(prot_ids)

# Make the amino acid fasta file 
make_aa_fasta(prot_ids, "my_proteins")

Keep length(prot_ids) < 450 and it will work.