Question: Download all peptide sequences from NCBI in fasta format?
0
gravatar for Tom
13 months ago by
Tom30
Tom30 wrote:

I want to download in fasta format all the peptide sequences in the NCBI protein database (i.e. > and the peptide name, followed by the peptide sequence), I saw there is a MESH term describing what a peptide is here, but I can't work out how to incorporate it.

I wrote this:

import Bio
from Bio import Entrez

Entrez.email = 'test@gmail.com'
handle = Entrez.esearch(db="protein", term="peptide")
record = handle.read()
out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

but it only prints out 995 IDs, no sequences to file, I'm wondering if someone could demonstrate where I'm going wrong.

biopython • 338 views
ADD COMMENTlink modified 11 months ago by Biostar ♦♦ 20 • written 13 months ago by Tom30

genomax appears to have answered. You may also find a couple of my Python scripts of some use for this work that you are doing: https://github.com/kevinblighe/PythonScripts

ADD REPLYlink written 11 months ago by Kevin Blighe69k
2
gravatar for GenoMax
13 months ago by
GenoMax94k
United States
GenoMax94k wrote:

Using EntrezDirect one can do something like this:

$ esearch -db protein -query "peptide" | efetch -format fasta | grep ">" | head -10
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
>QGT67062.1 peptide antibiotic transporter SbmA [Klebsiella pneumoniae]
>QGT66988.1 pyrroloquinoline quinone precursor peptide PqqA [Klebsiella pneumoniae]
>QGT66961.1 phenylalanyl--tRNA ligase operon leader peptide [Klebsiella pneumoniae]
>QGT66959.1 peptide chain release factor N(5)-glutamine methyltransferase [Klebsiella pneumoniae]
>QGT66942.1 his operon leader peptide [Klebsiella pneumoniae]
>QGT66735.1 ilv operon leader peptide [Klebsiella pneumoniae]

Remove the grep ">" | head -10 to get the actual sequences.

This may just get you sequences that have word peptide in their <Title> field. Not sure if that is what you ultimately want.

Using biopython is probably not the right choice of tool here since you are going to get hundreds of thousands of sequences.

You could also get the fasta file for nr blast database from NCBI and parse out things you need.

ADD COMMENTlink modified 13 months ago • written 13 months ago by GenoMax94k

this is fantastic thank you

ADD REPLYlink written 13 months ago by Tom30

Just to your earlier point about the number of sequences, is it possible to add a filter to only pull down in fasta sequence below a max length? Because i can see what you're saying, some just say peptide in the header but are full proteins, i just want to make a test set so pulling out the shorter sequences based on this criteria is fine. But let me know if you think this is a completely separate question.

Update: am trying this:

esearch -db protein -query "peptide '1:100[SLEN]" | efetch -format fast a >> ncbi_slen.fasta

ADD REPLYlink modified 13 months ago • written 13 months ago by Tom30

Try this to get peptides that are 30 AA or less. Remove head -15 to get more.

$ esearch -db protein -query "peptide" | esummary | xtract -pattern DocumentSummary -element Caption,Slen | head -15 | awk -F ' ' '{if ($2 < 30) {print $1}}'| xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta' 
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
MNRIGMITTIITTTITTGNGAG
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
MIRTARITSLLLLNACHLRGRLLGDVQR
>QGT66988.1 pyrroloquinoline quinone precursor peptide PqqA [Klebsiella pneumoniae]
MWKKPAFIDLRLGLEVTLYISNR
>QGT66961.1 phenylalanyl--tRNA ligase operon leader peptide [Klebsiella pneumoniae]
MNAAIFRFFFYFST
>QGT66942.1 his operon leader peptide [Klebsiella pneumoniae]
MNRVQFKHHHHHHHPD
ADD REPLYlink written 13 months ago by GenoMax94k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1625 users visited in the last hour
_