Question: Get Fasta File With Protein Sequences Given Entrez Gene Ids
4
gravatar for Stephen
5.8 years ago by
Stephen2.6k
Charlottesville Virginia
Stephen2.6k wrote:

I would like to get a FASTA file with protein sequences given a list of Entrez Gene IDs, e.g.:

19084
112407
18113
...etc.

For example, gene id 19084 links to protein ID 54036156, with fasta sequence

>gi|54036156|sp|Q9DBC7.3|KAP0_MOUSE RecName: Full=cAMP-dependent protein kinase type I-alpha regulatory subunit
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGSTLRKRKMYEEFLSKVSILESLDKWERLTVADALEPVQFEDGQ
KIVVQGEPGDEFFIILEGTAAVLQRRSENEEFVEVGRLGPSDYFGEIALLMNRPRAATVVARGPLKCVKL
DRPRFERVLGPCSDILKRNIQQYNSFVSLSV

I bet there's a way to do this with e-utils, but I can't figure out how. I realize that this can be done using the Ensembl Biomart, but the ID conversion from Entrez to Ensembl gene IDs results in lots of duplicates, i.e. one to many and many to one mappings both ways.

Edit: need to do this for ~17,000 gene IDs.

fasta entrez eutils • 13k views
ADD COMMENTlink modified 5.8 years ago by AGS230 • written 5.8 years ago by Stephen2.6k
7
gravatar for Pierre Lindenbaum
5.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum109k wrote:

use NCBI-ELINK to get the protein id from the gene-id

echo -e "19084\n112407\n18113" | while read G; do curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=protein&id=${G}" | grep -A 1 "<Link>" | grep "<Id>" | cut -d '>' -f 2 | cut -d '<' -f 1 | while read S ; do curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${S}&retmode=text&rettype=fasta" ; done;  done

>gi|336020222|gb|AEH77257.1| cAMP-dependent protein kinase type I-alpha regulatory subunit transcript variant 4, partial [Mus musculus]
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGST

>gi|336020220|gb|AEH77256.1| cAMP-dependent protein kinase type I-alpha regulatory subunit transcript variant 3, partial [Mus musculus]
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGST

>gi|336020218|gb|AEH77255.1| cAMP-dependent protein kinase type I-alpha regulatory subunit transcript variant 2, partial [Mus musculus]
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGST

>gi|148702421|gb|EDL34368.1| protein kinase, cAMP dependent regulatory, type I, alpha, isoform CRA_a [Mus musculus]
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
(...)

Updated: i've used ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz to get the relation gene-protein:

echo "19084" > geneids.txt
echo "112407" >> geneids.txt
sort -t '    ' -k1,1 geneids.txt | uniq > ordered_geneids.txt
curl -s "ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz" | gunzip -c | cut -d '    ' -f2,7 |\
        grep -v "-" | fgrep -w -f geneids.txt | sort -t '    ' -k1,1 -k2,2 -u  |\
            join -t '    ' -1 1 -2 1 ordered_geneids.txt - | cut -d '    ' -f2 > protein_gi.txt

result:

$ cat protein_gi.txt 
269308213
30794476

then use http://www.ncbi.nlm.nih.gov/sites/batchentrez and "download" to retrieve the sequences.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Pierre Lindenbaum109k

Thanks Pierre. Do you know how this will scale to ~17,000 IDs?

ADD REPLYlink written 5.8 years ago by Stephen2.6k

No. I would download NCBI ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz , extract the protein acn and use NCBI batch entrez with those acns.

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum109k

Did this.

zcat gene2accession.gz | awk '{if ($2==19084) print}' | wc -l

returns 37 lines. For gene id 19084, the protein accession I'm looking for is 54036156. How do I know this is the right one out of those 37 lines?

ADD REPLYlink written 5.8 years ago by Stephen2.6k

Using this approach I get a lot of duplicates unfortunately. Biomart, with your toy example of 3 ids, gives no duplicates.

ADD REPLYlink written 5.8 years ago by Joachim Jacob0

what is a "duplicate" ? same id or same sequence ? anyway, It's easy to remove the sequence/name duplicate with a simple sort | uniq

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum109k

Many, many protein IDs for the same gene ID. E.g. searching for entrez gene 19804 ( http://www.ncbi.nlm.nih.gov/gene/?term=19084 ) returns protein accession number Q9DBC7.3 (genepept 54036156 - http://www.ncbi.nlm.nih.gov/protein/54036156 ). But executing the command above:

zcat gene2accession.gz | awk '{if ($2==19084) print}' | wc -l

... returns 37 lines with unique protein accession numbers. I can't see what's different about the line containing 54036156. A simple sort|uniq wouldn't work in this case.

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by Stephen2.6k

I meant sort| uniq after linearizing the fasta sequence to get the unique sequences. Or just use "sort -u" to only keep one record per gene-id.

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum109k

Thanks Pierre. Hmm, looks like I'm exhausting my available 32GB memory at the fgrep step.

ADD REPLYlink written 5.8 years ago by Stephen2.6k

I was using it to go faster. You can remove it from the pipeline.

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum109k

Hello,

I'm looking at your answer two years later and am wondering if I could get some help with your updated code (for finding protein IDs as opposed to protein FASTA sequences). I tried the code as typed, but I get the following output:

sort: multi-character tab `    '

sort: multi-character tab `    '

join: illegal tab character specification

cut: bad delimiter

cut: bad delimiter

with no results (protein_gi.txt is blank, as is ordered_geneids.txt).When I remove -t ' '

after sort, ordered_geneids.txt is generated appropriately, but I am at a loss as to how I could alter the curl line to get the code to function. Any help would be greatly appreciated.

 
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by hmg20
1
gravatar for Vivek
5.8 years ago by
Vivek2.1k
Denmark
Vivek2.1k wrote:

You could likely use Batch entrez to do a bulk download for the 17k ids and get a sequence file in genbank format.

This can be parsed using Bio perl.

ADD COMMENTlink written 5.8 years ago by Vivek2.1k

It seems that the default genbank download does not contain that information because this is a related item and not the actual protein it encodes for. Perhaps there is a more detailed genbank or other format that does contain that information

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by Istvan Albert ♦♦ 77k
1
gravatar for AGS
5.8 years ago by
AGS230
Brooklyn, ny
AGS230 wrote:

I would convert the Entrez ID's to RefSeq mRNA ID's using DAVID:

DAVID

Save the text output, then use awk to grab the column of RefSeq ID's

awk '{print $2}' DAVID_conversions.txt > refseq_ids.txt

Paste that list into the TableBrowser @ UCSC to get a fasta file of all your genes of interest:

Genome Browser

Output for your test cases:


>NP_035054.1
MESGFTSKDTYLSHFNPRDYLEKYYSFGSRHCAENEILRHLLKNLFKIFC
LGAVKGELLIDIGSGPTIYQLLSACESFTEIIVSDYTDQNLWELQKWLKK
EPGAFDWSPVVTYVCDLEGNRMKGPEKEEKLRRAIKQVLKCDVTQSQPLG
GVSLPPADCLLSTLCLDAACPDLPAYRTALRNLGSLLKPGGFLVMVDALK
SSYYMIGEQKFSSLPLGWETVRDAVEEAGYTIEQFEVISQNYSSTTSNNE
GLFSLVGRKPGRSE
>NP_068680.1
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFL
REYFERLEKEEARQIQCLQKTGIRTDSREDEISPPPPNPVVKGRRRRGAI
SAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSHLDDNERSDIF
DAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSF
GELALIYGTPRAATVKAKTNVKLWGIDRDSYRRILMGSTLRKRKMYEEFL
SKVSILESLDKWERLTVADALEPVQFEDGQKIVVQGEPGDEFFIILEGTA
AVLQRRSENEEFVEVGRLGPSDYFGEIALLMNRPRAATVVARGPLKCVKL
DRPRFERVLGPCSDILKRNIQQYNSFVSLSV
>NP_082409.2
MPLGHIMRLDLEKIALEYIVPCLHEVGFCYLDNFLGEVVGDCVLERVKQL
HYNGALRDGQLAGPCAGVSKRHLRGDQITWIGGNEEGCEAINFLLSLIDR
LVLYCGSRLGKYYVKERSKAMVACYPGNGTGYVRHVDNPNGDGRCITCIY
YLNKNWDAKLHGGVLRIFPEGKSFVADVEPIFDRLLFFWSDRRNPHEVQP
SYATRYAMTVWYFDAEERAEAKKKFRNLTRKTESALAKD
ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by AGS230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 938 users visited in the last hour