I need to retrieve a set of protein and mRNA sequences
1
0
Entering edit mode
7 months ago
george • 0

Hello, I have a file with a list of proteins and mRNAs and I need a way to retrieve the sequences for all of them ( about 13000 proteins and 12000 mRNAs) while I have the Entrez gene ID for each one.

Protein mRNA sequences • 574 views
ADD COMMENT
0
Entering edit mode

What have you tried?

ADD REPLY
0
Entering edit mode

I tried using curl because I saw a similar post here but that didn't go well so now I am trying the answer of GenoMax bellow

ADD REPLY
1
Entering edit mode

I tried using curl

curl is a command line utility that can do a range of operations. What exactly did you do, what did you expect and what errors did you run into? Always include these details when you ask for help.

ADD REPLY
0
Entering edit mode
7 months ago
GenoMax 141k

Using EntrezDirect:

$ esearch -db gene -query '7157' | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta | grep ">"
>NR_176326.1 Homo sapiens tumor protein p53 (TP53), transcript variant 14, non-coding RNA
>NM_001407264.1 Homo sapiens tumor protein p53 (TP53), transcript variant 10, mRNA
>NM_001407263.1 Homo sapiens tumor protein p53 (TP53), transcript variant 9, mRNA
>NM_001407268.1 Homo sapiens tumor protein p53 (TP53), transcript variant 12, mRNA
>NM_001407269.1 Homo sapiens tumor protein p53 (TP53), transcript variant 12, mRNA
>NM_001407271.1 Homo sapiens tumor protein p53 (TP53), transcript variant 13, mRNA
>NM_001407270.1 Homo sapiens tumor protein p53 (TP53), transcript variant 13, mRNA

For proteins

$ esearch -db gene -query '7157' | elink -db gene -target protein -name gene_protein_refseq | efetch -format fasta | grep ">"
>NP_001394193.1 cellular tumor antigen p53 isoform a [Homo sapiens]
>NP_001394192.1 cellular tumor antigen p53 isoform g [Homo sapiens]
>NP_001394197.1 cellular tumor antigen p53 isoform b [Homo sapiens]
>NP_001394198.1 cellular tumor antigen p53 isoform i [Homo sapiens]
>NP_001394200.1 cellular tumor antigen p53 isoform i [Homo sapiens]
>NP_001394199.1 cellular tumor antigen p53 isoform b [Homo sapiens]

Option 2:

Use datasets: https://www.ncbi.nlm.nih.gov/datasets/gene/

Click on the search by identifiers tab and paste your identifiers in (may need to do in batches).

On results page, select gene, click Downloads --> Download package --> Choose gene/protein/transcript sequence --> Download data package.

Can also be done using command line datasets.

ADD COMMENT

Login before adding your answer.

Traffic: 2245 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6