Hi, I'm looking to extract protein ID and sequence based on their size. More specifically I would like to identify in different genome all the protein with a size in a range of 40-150 AA. Any idea? Thanks
The corresponding query in the UniProt Knowledgebase is
length:[40 TO 150] http://www.uniprot.org/uniprot/?query=length%3A[40+TO+150]&sort=score
You can use the Advanced Search to obtain this query (first select "Sequence", then "Length" and specify your ranges).
I guess since you tagged database and query it is about downloading from a database. So here is an Idea how you could do that with Entrez Direct:
esearch -db protein -query "Staphylococcus aureus [ORGN]" | efilter -query "40:150 [SLEN]" | efetch -format fasta > aureus_protein_test
In this case Staph aureus is just an example. You just have to place your desired Organism name there and then you are good to go. And if you have a list of different Organisms you could read the list in a loop and download the desired proteins for every organism with one command.