How do I find a short protein sequence on python with fasta file from uniprot ?
0
0
Entering edit mode
5.5 years ago
biobio • 0

I'm a python beginner, and I'd like to find a really short protein sequence from uniprot data.

I have the file open as this,

fastafile = open('/Users/desktop/uniprot_sprot.fasta','r')
read1 = fastafile.readlines()
protdict = {}
for i in range(0,len(read1),2):
    protdict[read1[i]]=read1[i+1]

And I want to find out if there's a matching sequence in the data, and if there is, the name of the sequence. Please help!! I would really appreciate it.

sequence python protein sequence fasta • 1.4k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

without python: linearize fasta, sort on length

wget -q -O - "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" | gunzip -c | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  | awk -F '\t' '{L=length($2);printf("%d\t%s\n",L,$0);}' | sort -t $'\t' -k1,1n | head | cut -f 2- | tr "\t" "\n"

>sp|P83570|GWA_SEPOF Neuropeptide GWa OS=Sepia officinalis OX=6610 PE=1 SV=1
GW
>sp|P62968|TRH_PIG Thyrotropin-releasing hormone OS=Sus scrofa OX=9823 GN=TRH PE=1 SV=1
QHP
>sp|P62969|TRH_SHEEP Thyrotropin-releasing hormone OS=Ovis aries OX=9940 GN=TRH PE=1 SV=1
QHP
>sp|P62970|TRH_BOMOR Thyrotropin-releasing hormone OS=Bombina orientalis OX=8346 PE=1 SV=1
QHP
>sp|P62971|TRH_NOTVI Thyrotropin-releasing hormone OS=Notophthalmus viridescens OX=8316 PE=1 SV=1
QHP
>sp|P84761|ACI_MACGN Angiotensin-1-converting enzyme inhibitory peptide OS=Macrocybe gigantea OX=1491104 PE=1 SV=1
GEP
>sp|P01858|TUFT_HUMAN Phagocytosis-stimulating peptide OS=Homo sapiens OX=9606 PE=1 SV=1
TKPR
>sp|P0DPI4|TDB01_HUMAN T cell receptor beta diversity 1 OS=Homo sapiens OX=9606 GN=TRBD1 PE=4 SV=1
GTGG
>sp|P19916|DCML_PSECH Carbon monoxide dehydrogenase large chain (Fragment) OS=Pseudomonas carboxydohydrogena OX=290 GN=cutL PE=1 SV=1
MGHP
>sp|P19918|DCMS_PSECH Carbon monoxide dehydrogenase small chain (Fragment) OS=Pseudomonas carboxydohydrogena OX=290 GN=cutS PE=1 SV=1
MAKA
ADD REPLY

Login before adding your answer.

Traffic: 2906 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6