Question: Identify all proteins smaller than 150 AA
0
gravatar for dominiquealain.blanchard
2.4 years ago by

Hi, I'm looking to extract protein ID and sequence based on their size. More specifically I would like to identify in different genome all the protein with a size in a range of 40-150 AA. Any idea? Thanks

query protein sequence database • 735 views
ADD COMMENTlink modified 2.4 years ago by Elisabeth Gasteiger1.6k • written 2.4 years ago by dominiquealain.blanchard0

What do you mean with Identify? You want to download them from for example from NCBI for different organisms or do you mean something else?

ADD REPLYlink written 2.4 years ago by j_susat40

More specifically I would like to identify in different genome all the protein with a size in a range of 40-150 AA.

This question (in present form) is not logical. Practically, every known genome is likely to have protein(s) that fall in the range of 40-150 AA. You need to specify some additional criteria to narrow the selection.

You may also want to do this search using well known/annotated proteins from UniProt, specifically SWISSPROT.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax67k
1
gravatar for Elisabeth Gasteiger
2.4 years ago by
Geneva
Elisabeth Gasteiger1.6k wrote:

The corresponding query in the UniProt Knowledgebase is

length:[40 TO 150]

http://www.uniprot.org/uniprot/?query=length%3A[40+TO+150]&sort=score

You can use the Advanced Search to obtain this query (first select "Sequence", then "Length" and specify your ranges).

ADD COMMENTlink modified 2.4 years ago by genomax67k • written 2.4 years ago by Elisabeth Gasteiger1.6k
0
gravatar for j_susat
2.4 years ago by
j_susat40
Kiel
j_susat40 wrote:

Ok,

I guess since you tagged database and query it is about downloading from a database. So here is an Idea how you could do that with Entrez Direct:

esearch -db protein -query "Staphylococcus aureus [ORGN]" | efilter -query "40:150 [SLEN]" | efetch -format fasta > aureus_protein_test

In this case Staph aureus is just an example. You just have to place your desired Organism name there and then you are good to go. And if you have a list of different Organisms you could read the list in a loop and download the desired proteins for every organism with one command.

Here are some infos about Entrez Direct

ADD COMMENTlink written 2.4 years ago by j_susat40
0
gravatar for dominiquealain.blanchard
2.4 years ago by

Thank you. I'm surprise to see 316536 references. Could it be possible to eliminate duplicates and restrict the search to secreted proteines?

ADD COMMENTlink written 2.4 years ago by dominiquealain.blanchard0

Please use ADD COMMENT to answer to earlier replies, as such this thread remains logically structured and easy to follow.

ADD REPLYlink written 2.4 years ago by WouterDeCoster39k

You could try this:

esearch -db protein -query "Staphylococcus aureus [ORGN] AND refseq[filter]" | efilter -query "40:150 [SLEN] AND secretion [ALL]" | efetch -format fasta  > aureus_protein_test

it is a bit more stringent due to refseq and secretion restricitons. I guess there is a better way to search in every field for secretion but I have no clue at the moment.

ADD REPLYlink written 2.4 years ago by j_susat40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1975 users visited in the last hour