Question: Download all peptide sequences from NCBI in fasta format?
0
gravatar for Tom
8 months ago by
Tom30
Tom30 wrote:

I want to download in fasta format all the peptide sequences in the NCBI protein database (i.e. > and the peptide name, followed by the peptide sequence), I saw there is a MESH term describing what a peptide is here, but I can't work out how to incorporate it.

I wrote this:

import Bio
from Bio import Entrez

Entrez.email = 'test@gmail.com'
handle = Entrez.esearch(db="protein", term="peptide")
record = handle.read()
out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

but it only prints out 995 IDs, no sequences to file, I'm wondering if someone could demonstrate where I'm going wrong.

biopython • 222 views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 8 months ago by Tom30

genomax appears to have answered. You may also find a couple of my Python scripts of some use for this work that you are doing: https://github.com/kevinblighe/PythonScripts

ADD REPLYlink written 6 months ago by Kevin Blighe63k
2
gravatar for genomax
8 months ago by
genomax87k
United States
genomax87k wrote:

Using EntrezDirect one can do something like this:

$ esearch -db protein -query "peptide" | efetch -format fasta | grep ">" | head -10
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
>QGT67062.1 peptide antibiotic transporter SbmA [Klebsiella pneumoniae]
>QGT66988.1 pyrroloquinoline quinone precursor peptide PqqA [Klebsiella pneumoniae]
>QGT66961.1 phenylalanyl--tRNA ligase operon leader peptide [Klebsiella pneumoniae]
>QGT66959.1 peptide chain release factor N(5)-glutamine methyltransferase [Klebsiella pneumoniae]
>QGT66942.1 his operon leader peptide [Klebsiella pneumoniae]
>QGT66735.1 ilv operon leader peptide [Klebsiella pneumoniae]

Remove the grep ">" | head -10 to get the actual sequences.

This may just get you sequences that have word peptide in their <Title> field. Not sure if that is what you ultimately want.

Using biopython is probably not the right choice of tool here since you are going to get hundreds of thousands of sequences.

You could also get the fasta file for nr blast database from NCBI and parse out things you need.

ADD COMMENTlink modified 8 months ago • written 8 months ago by genomax87k

this is fantastic thank you

ADD REPLYlink written 8 months ago by Tom30

Just to your earlier point about the number of sequences, is it possible to add a filter to only pull down in fasta sequence below a max length? Because i can see what you're saying, some just say peptide in the header but are full proteins, i just want to make a test set so pulling out the shorter sequences based on this criteria is fine. But let me know if you think this is a completely separate question.

Update: am trying this:

esearch -db protein -query "peptide '1:100[SLEN]" | efetch -format fast a >> ncbi_slen.fasta

ADD REPLYlink modified 8 months ago • written 8 months ago by Tom30

Try this to get peptides that are 30 AA or less. Remove head -15 to get more.

$ esearch -db protein -query "peptide" | esummary | xtract -pattern DocumentSummary -element Caption,Slen | head -15 | awk -F ' ' '{if ($2 < 30) {print $1}}'| xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta' 
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
MNRIGMITTIITTTITTGNGAG
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
MIRTARITSLLLLNACHLRGRLLGDVQR
>QGT66988.1 pyrroloquinoline quinone precursor peptide PqqA [Klebsiella pneumoniae]
MWKKPAFIDLRLGLEVTLYISNR
>QGT66961.1 phenylalanyl--tRNA ligase operon leader peptide [Klebsiella pneumoniae]
MNAAIFRFFFYFST
>QGT66942.1 his operon leader peptide [Klebsiella pneumoniae]
MNRVQFKHHHHHHHPD
ADD REPLYlink written 8 months ago by genomax87k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 666 users visited in the last hour