Question

How do I find a short protein sequence on python with fasta file from uniprot ?

0

Entering edit mode

5.5 years ago

biobio • 0

I'm a python beginner, and I'd like to find a really short protein sequence from uniprot data.

I have the file open as this,

fastafile = open('/Users/desktop/uniprot_sprot.fasta','r')
read1 = fastafile.readlines()
protdict = {}
for i in range(0,len(read1),2):
    protdict[read1[i]]=read1[i+1]

And I want to find out if there's a matching sequence in the data, and if there is, the name of the sequence. Please help!! I would really appreciate it.

sequence python protein sequence fasta • 1.4k views

ADD COMMENT • link updated 5.5 years ago by Pierre Lindenbaum 161k • written 5.5 years ago by biobio • 0

0

Entering edit mode

don't put everything in memory with readlines
iterate over each fasta record Correct Way To Parse A Fasta File In Python

ADD REPLY • link 5.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

without python: linearize fasta, sort on length

wget -q -O - "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" | gunzip -c | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  | awk -F '\t' '{L=length($2);printf("%d\t%s\n",L,$0);}' | sort -t $'\t' -k1,1n | head | cut -f 2- | tr "\t" "\n"

>sp|P83570|GWA_SEPOF Neuropeptide GWa OS=Sepia officinalis OX=6610 PE=1 SV=1
GW
>sp|P62968|TRH_PIG Thyrotropin-releasing hormone OS=Sus scrofa OX=9823 GN=TRH PE=1 SV=1
QHP
>sp|P62969|TRH_SHEEP Thyrotropin-releasing hormone OS=Ovis aries OX=9940 GN=TRH PE=1 SV=1
QHP
>sp|P62970|TRH_BOMOR Thyrotropin-releasing hormone OS=Bombina orientalis OX=8346 PE=1 SV=1
QHP
>sp|P62971|TRH_NOTVI Thyrotropin-releasing hormone OS=Notophthalmus viridescens OX=8316 PE=1 SV=1
QHP
>sp|P84761|ACI_MACGN Angiotensin-1-converting enzyme inhibitory peptide OS=Macrocybe gigantea OX=1491104 PE=1 SV=1
GEP
>sp|P01858|TUFT_HUMAN Phagocytosis-stimulating peptide OS=Homo sapiens OX=9606 PE=1 SV=1
TKPR
>sp|P0DPI4|TDB01_HUMAN T cell receptor beta diversity 1 OS=Homo sapiens OX=9606 GN=TRBD1 PE=4 SV=1
GTGG
>sp|P19916|DCML_PSECH Carbon monoxide dehydrogenase large chain (Fragment) OS=Pseudomonas carboxydohydrogena OX=290 GN=cutL PE=1 SV=1
MGHP
>sp|P19918|DCMS_PSECH Carbon monoxide dehydrogenase small chain (Fragment) OS=Pseudomonas carboxydohydrogena OX=290 GN=cutS PE=1 SV=1
MAKA

ADD REPLY • link 5.5 years ago by Pierre Lindenbaum 161k