Question

I need to retrieve a set of protein and mRNA sequences

0

Entering edit mode

7 months ago

george • 0

Hello, I have a file with a list of proteins and mRNAs and I need a way to retrieve the sequences for all of them ( about 13000 proteins and 12000 mRNAs) while I have the Entrez gene ID for each one.

Protein mRNA sequences • 574 views

ADD COMMENT • link updated 7 months ago by Ram 43k • written 7 months ago by george • 0

0

Entering edit mode

What have you tried?

ADD REPLY • link 7 months ago by Ram 43k

0

Entering edit mode

I tried using curl because I saw a similar post here but that didn't go well so now I am trying the answer of GenoMax bellow

ADD REPLY • link 7 months ago by george • 0

1

Entering edit mode

I tried using curl

curl is a command line utility that can do a range of operations. What exactly did you do, what did you expect and what errors did you run into? Always include these details when you ask for help.

ADD REPLY • link 7 months ago by Ram 43k

score 0 · Answer 1 · 2023-09-20

Using EntrezDirect:

$ esearch -db gene -query '7157' | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta | grep ">"
>NR_176326.1 Homo sapiens tumor protein p53 (TP53), transcript variant 14, non-coding RNA
>NM_001407264.1 Homo sapiens tumor protein p53 (TP53), transcript variant 10, mRNA
>NM_001407263.1 Homo sapiens tumor protein p53 (TP53), transcript variant 9, mRNA
>NM_001407268.1 Homo sapiens tumor protein p53 (TP53), transcript variant 12, mRNA
>NM_001407269.1 Homo sapiens tumor protein p53 (TP53), transcript variant 12, mRNA
>NM_001407271.1 Homo sapiens tumor protein p53 (TP53), transcript variant 13, mRNA
>NM_001407270.1 Homo sapiens tumor protein p53 (TP53), transcript variant 13, mRNA

For proteins

$ esearch -db gene -query '7157' | elink -db gene -target protein -name gene_protein_refseq | efetch -format fasta | grep ">"
>NP_001394193.1 cellular tumor antigen p53 isoform a [Homo sapiens]
>NP_001394192.1 cellular tumor antigen p53 isoform g [Homo sapiens]
>NP_001394197.1 cellular tumor antigen p53 isoform b [Homo sapiens]
>NP_001394198.1 cellular tumor antigen p53 isoform i [Homo sapiens]
>NP_001394200.1 cellular tumor antigen p53 isoform i [Homo sapiens]
>NP_001394199.1 cellular tumor antigen p53 isoform b [Homo sapiens]

Option 2:

Use datasets: https://www.ncbi.nlm.nih.gov/datasets/gene/

Click on the search by identifiers tab and paste your identifiers in (may need to do in batches).

On results page, select gene, click Downloads --> Download package --> Choose gene/protein/transcript sequence --> Download data package.

Can also be done using command line datasets.