Protein domain sequence extraction with Python
2
0
Entering edit mode
5.9 years ago

Hi all,

I would like to extract specific domains from protein sequence. To do that I have two files + python script (from http://www.bioinformatics-made-simple.com/2013/10/actually-i-have-hundreds-of-protein.html). For example (I don't know if it is possible to attach files...):

File 1 (.fasta; tab delimited):

>sp|P17672.2|E75BA_DROMERecName:Full=Ecdysone-inducedprotein75B,isoformA;Short=E75-B;AltName:Full=Nuclearreceptorsubfamily1groupDmember3,isoformA
MVCAMQEVAAVQHQQQQQQLQLPQQQQQQQQTTQQQHATTIVLLTGNGGGNLHIVATPQQHQPMHQLHHQ
HQHQHQHQQQAKSQQLKQQHSALVKLLESAPIKQQQQTPKQIVYLQQQQQQPQRKRLKNEAAIVQQQQQT
PATLVKTTTTSNSNSNNTQTTNSISQQQQQHQIVLQHQQPAAAATPKPCADLSAKNDSESGIDEDSPNSD
EDCPNANPAGTSLEDSSYEQYQCPWKKIRYARELKQRELEQQQTTGGSNAQQQVEAKPAAIPTSNIKQLH
CDSPFSAQTHKEIANLLRQQSQQQQVVATQQQQQQQQQHQHQQQRRDSSDSNCSLMSNSSNSSAGNCCTC
NAGDDQQLEEMDEAHDSGCDDELCEQHHQRLDSSQLNYLCQKFDEKLDTALSNSSANTGRNTPAVTANED
ADGFFRRSIQQKIQYRPCTKNQQCSILRINRNRCQYCRLKKCIAVGMSRDAVRFGRVPKREKARILAAMQ
QSTQNRGQQRALATELDDQPRLLAAVLRAHLETCEFTKEKVSAMRQRARDCPSYSMPTLLACPLNPAPEL
QSEQEFSQRFAHVIRGVIDFAGMIPGFQLLTQDDKFTLLKAGLFDALFVRLICMFDSSINSIICLNGQVM
RRDAIQNGANARFLVDSTFNFAERMNSMNLTDAEIGLFCAIVLITPDRPGLRNLELIEKMYSRLKGCLQY
IVAQNRPDQPEFLAKLLETMPDLRTLSTLHTEKLVVFRTEHKELLRQQMWSMEDGNNSDGQQNKSPSGSW
ADAMDVEAAKSPLGSVSSTESADLDYGSPSSSQPQGVSLPSPPQQQPSALASSAPLLAATLSGGCPLRNR
ANSGSSGDSGAAEMDIVGSHAHLTQNGLTITPIVRHQQQQQQQQQIGILNNAHSRNLNGGHAMCQQQQQH
PQLHHHLTAGAARYRKLDSPTDSGIESGNEKNECKAVSSGGSSSCSSPRSSVDDALDCSDAAANHNQVVQ
HPQLSVVSVSPVRSPQPSTSSHLKRQIVEDMPVLKRVLQAPPLYDTNSLMDEAYKPHKKFRALRHREFET
AEADASSSTSGSNSLSAGSPRQSPVPNSVATPPPSAASAAAGNPAQSQLHMHLTRSSPKASMASSHSVLA
KSLMAEPRMTPEQMKRSDIIQNYLKRENSTAASSTTNGVGNRSPSSSSTPPPSAVQNQQRWGSSSVITTT
CQQRQQSVSPHSNGSSSSSSSSSSSSSSSSSTSSNCSSSSASSCQYFQSPHSTSNGTSAPASSSSGSNSA
TPLLELQVDIADSAQPLNLSKKSPTPPPSKLHALVAAANAVQRYPTLSADVTVTASNGGPPSAAASPAPS
SSPPASVGSPNPGLSAAVHKVMLEA

>sp|P45447.4|E78C_DROMERecName:Full=Ecdysone-inducedprotein78C;Short=DR-78;AltName:Full=Nuclearreceptorsubfamily1groupEmember1
MDVYQIELEEQAQIRSKLLVETCVKHSSSEQQQLQVKQEDLIKDFTRDEEEQPSEEEAEEEDNEEDEEEE
GEEEEEDEDEEALLPVVNFNANSDFNLHFFDTPEDSSTQGAYSEANSLESEQEEEKQTQQHQQQKQHHRD
LEDCLSAIEADPLQLLHCDDFYRTSALAESVAASLSPQQQQQRQHTHQQQQQQQQQQQHPGQQQHQLNCT
LSNGGGALYTISSVHQFGPASNHNTSSSSPSSSAAHSSPDSGCSSASSSGSSRSCGSSSASSSSSAVSST
ISSGRSSNNSVVNPAATSSSVAHLNKEQQQQPLPTTQLQQQQQHQQQLQHPQQQQSFGLADSSSSNGSSN
NNNGVSSKSFVPCKVCGDKASGYHYGVTSCEGCKGFFRRSIQKQIEYRCLRDGKCLVIRLNRNRCQYCRF
KKCLSAGMSRDSVRYGRVPKRSRELNGAAASSAAAGAPASLNVDDSTSSTLHPSHLQQQQQQHLLQQQQQ
QQHQPQLQQHHQLQQQPHVSGVRVKTPSTPQTPQMCSIASSPSELGGCNSANNNNNNNNNSSSGNASGGS
GVSVGVVVVGGHQQLVGGSMVGMAGMGTDAHQVGMCHDGLAGTANELTVYDVIMCVSQAHRLNCSYTEEL
TRELMRRPVTVPQNGIASTVAESLEFQKIWLWQQFSARVTPGVQRIVEFAKRVPGFCDFTQDDQLILIKL
GFFEVWLTHVARLINEATLTLDDGAYLTRQQLEILYDSDFVNALLNFANTLNAYGLSDTEIGLFSAMVLL
ASDRAGLSEPKVIGRARELVAEALRVQILRSRAGSPQALQLMPALEAKIPELRSLGAKHFSHLDWLRMNW
TKLRLPPLFAEIFDIPKADDEL

File 2 (.txt; tab delimited):

sp|P17672.2|E75BA_DROMERecName:Full=Ecdysone-inducedprotein75B,isoformA;Short=E75-B;AltName:Full=Nuclearreceptorsubfamily1groupDmember3,isoformA    415 484

sp|P45447.4|E78C_DROMERecName:Full=Ecdysone-inducedprotein78C;Short=DR-78;AltName:Full=Nuclearreceptorsubfamily1groupEmember1   363 443

Script file domainseq.py):

import sys
import re

FASTA= sys.argv[1]
BED= sys.argv[2]

fasta= open(FASTA, 'U')
fasta_dict= {}
for line in fasta:
    line= line.strip()


 if line == '':
        continue
    if line.startswith('>'):
        seqname= line.lstrip('>')
        seqname= re.sub('\..*', '', seqname)
        fasta_dict[seqname]= ''
    else:
        fasta_dict[seqname] += line
fasta.close()

bed= open(BED, 'U')
for line in bed:
    line= line.strip().split('\t')
    outname= line[0] + ':' + line[1] + '-' + line[2]
    print('>' + outname)
    s= int(line[1])
    e= int(line[2])
    print(fasta_dict[line[0]][s:e])
bed.close()
sys.exit()

And finally the error I get:

File "domainseq.py", line 27, in <module>
outname= line[0] + ':' + line[1] + '-' + line[2]
IndexError: list index out of range

Do you have any suggestions? Thank you very much, Julien

sequence gene python • 1.8k views
ADD COMMENT
2
Entering edit mode
5.9 years ago
Gungor Budak ▴ 270

File 1 has some formatting issues. It should have a header and one or multiple sequence lines. Like this:

>sp|P17672.2|E75BA_DROMERecName:Full=Ecdysone-inducedprotein75B,isoformA;Short=E75-B;AltName:Full=Nuclearreceptorsubfamily1groupDmember3,isoformA
MVCAMQEVAAVQHQQQQQQLQLPQQQQQQQQTTQQQHATTIVLLTGNGGGNLHIVATPQQHQPMHQLHHQHQHQHQHQQQAKSQQLKQQHSALVKLLESAPIKQQQQTPKQIVYLQQQQQQPQRKRLKNEAAIVQQQQQTPATLVKTTTTSNSNSNNTQTTNSISQQQQQHQIVLQHQQPAAAATPKPCADLSAKNDSESGIDEDSPNSDEDCPNANPAGTSLEDSSYEQYQCPWKKIRYARELKQRELEQQQTTGGSNAQQQVEAKPAAIPTSNIKQLHCDSPFSAQTHKEIANLLRQQSQQQQVVATQQQQQQQQQHQHQQQRRDSSDSNCSLMSNSSNSSAGNCCTCNAGDDQQLEEMDEAHDSGCDDELCEQHHQRLDSSQLNYLCQKFDEKLDTALSNSSANTGRNTPAVTANEDADGFFRRSIQQKIQYRPCTKNQQCSILRINRNRCQYCRLKKCIAVGMSRDAVRFGRVPKREKARILAAMQQSTQNRGQQRALATELDDQPRLLAAVLRAHLETCEFTKEKVSAMRQRARDCPSYSMPTLLACPLNPAPELQSEQEFSQRFAHVIRGVIDFAGMIPGFQLLTQDDKFTLLKAGLFDALFVRLICMFDSSINSIICLNGQVMRRDAIQNGANARFLVDSTFNFAERMNSMNLTDAEIGLFCAIVLITPDRPGLRNLELIEKMYSRLKGCLQYIVAQNRPDQPEFLAKLLETMPDLRTLSTLHTEKLVVFRTEHKELLRQQMWSMEDGNNSDGQQNKSPSGSWADAMDVEAAKSPLGSVSSTESADLDYGSPSSSQPQGVSLPSPPQQQPSALASSAPLLAATLSGGCPLRNRANSGSSGDSGAAEMDIVGSHAHLTQNGLTITPIVRHQQQQQQQQQIGILNNAHSRNLNGGHAMCQQQQQHPQLHHHLTAGAARYRKLDSPTDSGIESGNEKNECKAVSSGGSSSCSSPRSSVDDALDCSDAAANHNQVVQHPQLSVVSVSPVRSPQPSTSSHLKRQIVEDMPVLKRVLQAPPLYDTNSLMDEAYKPHKKFRALRHREFETAEADASSSTSGSNSLSAGSPRQSPVPNSVATPPPSAASAAAGNPAQSQLHMHLTRSSPKASMASSHSVLAKSLMAEPRMTPEQMKRSDIIQNYLKRENSTAASSTTNGVGNRSPSSSSTPPPSAVQNQQRWGSSSVITTTCQQRQQSVSPHSNGSSSSSSSSSSSSSSSSSTSSNCSSSSASSCQYFQSPHSTSNGTSAPASSSSGSNSATPLLELQVDIADSAQPLNLSKKSPTPPPSKLHALVAAANAVQRYPTLSADVTVTASNGGPPSAAASPAPSSSPPASVGSPNPGLSAAVHKVMLEA

>sp|P45447.4|E78C_DROMERecName:Full=Ecdysone-inducedprotein78C;Short=DR-78;AltName:Full=Nuclearreceptorsubfamily1groupEmember1
MDVYQIELEEQAQIRSKLLVETCVKHSSSEQQQLQVKQEDLIKDFTRDEEEQPSEEEAEEEDNEEDEEEEGEEEEEDEDEEALLPVVNFNANSDFNLHFFDTPEDSSTQGAYSEANSLESEQEEEKQTQQHQQQKQHHRDLEDCLSAIEADPLQLLHCDDFYRTSALAESVAASLSPQQQQQRQHTHQQQQQQQQQQQHPGQQQHQLNCTLSNGGGALYTISSVHQFGPASNHNTSSSSPSSSAAHSSPDSGCSSASSSGSSRSCGSSSASSSSSAVSSTISSGRSSNNSVVNPAATSSSVAHLNKEQQQQPLPTTQLQQQQQHQQQLQHPQQQQSFGLADSSSSNGSSNNNNGVSSKSFVPCKVCGDKASGYHYGVTSCEGCKGFFRRSIQKQIEYRCLRDGKCLVIRLNRNRCQYCRFKKCLSAGMSRDSVRYGRVPKRSRELNGAAASSAAAGAPASLNVDDSTSSTLHPSHLQQQQQQHLLQQQQQQQHQPQLQQHHQLQQQPHVSGVRVKTPSTPQTPQMCSIASSPSELGGCNSANNNNNNNNNSSSGNASGGSGVSVGVVVVGGHQQLVGGSMVGMAGMGTDAHQVGMCHDGLAGTANELTVYDVIMCVSQAHRLNCSYTEELTRELMRRPVTVPQNGIASTVAESLEFQKIWLWQQFSARVTPGVQRIVEFAKRVPGFCDFTQDDQLILIKLGFFEVWLTHVARLINEATLTLDDGAYLTRQQLEILYDSDFVNALLNFANTLNAYGLSDTEIGLFSAMVLLASDRAGLSEPKVIGRARELVAEALRVQILRSRAGSPQALQLMPALEAKIPELRSLGAKHFSHLDWLRMNWTKLRLPPLFAEIFDIPKADDEL

File 2 is not Tab separated, see below:

sp|P17672.2|E75BA_DROMERecName:Full=Ecdysone-inducedprotein75B,isoformA;Short=E75-B;AltName:Full=Nuclearreceptorsubfamily1groupDmember3,isoformA    415 484

sp|P45447.4|E78C_DROMERecName:Full=Ecdysone-inducedprotein78C;Short=DR-78;AltName:Full=Nuclearreceptorsubfamily1groupEmember1   363 443

The script wasn't working due to formatting as the second file wasn't Tab separated and it couldn't really split using \t. It also had some minor issues I fixed:

import sys
import re

FASTA = sys.argv[1]
BED = sys.argv[2]

fasta= open(FASTA, 'U')
fasta_dict= {}
for line in fasta:
    line= line.strip()
    if line == '':
        continue
    elif line.startswith('>'):
        seqname= line.lstrip('>')
        seqname= re.sub('\..*', '', seqname)
        fasta_dict[seqname]= ''
    else:
        fasta_dict[seqname] += line
fasta.close()

bed= open(BED, 'U')
for line in bed:
    line= line.strip().split('\t')
    if line[0] == '':
        continue
    outname= line[0] + ':' + line[1] + '-' + line[2]
    seqname= line[0].lstrip('>')
    seqname= re.sub('\..*', '', seqname)
    print('>' + outname)
    s= int(line[1])
    e= int(line[2])
    print(fasta_dict[seqname][s:e])
bed.close()
sys.exit()

Now, when executed:

Gungors-MacBook-Pro:Desktop gungor$ python script.py file1.fa file2.tsv
>sp|P17672.2|E75BA_DROMERecName:Full=Ecdysone-inducedprotein75B,isoformA;Short=E75-B;AltName:Full=Nuclearreceptorsubfamily1groupDmember3,isoformA:415-484
TANEDADGFFRRSIQQKIQYRPCTKNQQCSILRINRNRCQYCRLKKCIAVGMSRDAVRFGRVPKREKAR
>sp|P45447.4|E78C_DROMERecName:Full=Ecdysone-inducedprotein78C;Short=DR-78;AltName:Full=Nuclearreceptorsubfamily1groupEmember1:363-443
KVCGDKASGYHYGVTSCEGCKGFFRRSIQKQIEYRCLRDGKCLVIRLNRNRCQYCRFKKCLSAGMSRDSVRYGRVPKRSR
ADD COMMENT
0
Entering edit mode

Thank you very much!!! :)

ADD REPLY
1
Entering edit mode
5.9 years ago

Code:

> import os
> from pyfaidx import Fasta
> test=Fasta("test.fa")
> with open ("ids.txt",) as f:
        ids=f.readlines()
> sids=[s.split() for s in ids]
> final_results=[test[i[0]][int(i[1]):int(i[2])] for i in sids]
> with open ("output.txt","w") as f:
        for i in final_results:
            print (">"+i.name,"\n",i.seq,file=f)

output:

>sp|P17672.2|E75BA_DROMERecName:Full=Ecdysone-inducedprotein75B,isoformA;Short=E75-B;AltName:Full=Nuclearreceptorsubfamily1groupDmember3,isoformA 
 TANEDADGFFRRSIQQKIQYRPCTKNQQCSILRINRNRCQYCRLKKCIAVGMSRDAVRFGRVPKREKAR
>sp|P45447.4|E78C_DROMERecName:Full=Ecdysone-inducedprotein78C;Short=DR-78;AltName:Full=Nuclearreceptorsubfamily1groupEmember1 
 KVCGDKASGYHYGVTSCEGCKGFFRRSIQKQIEYRCLRDGKCLVIRLNRNRCQYCRFKKCLSAGMSRDSVRYGRVPKRSR

input filters (input fasta is not pasted as it is large):

$ cat ids.txt 
sp|P17672.2|E75BA_DROMERecName:Full=Ecdysone-inducedprotein75B,isoformA;Short=E75-B;AltName:Full=Nuclearreceptorsubfamily1groupDmember3,isoformA    415 484
sp|P45447.4|E78C_DROMERecName:Full=Ecdysone-inducedprotein78C;Short=DR-78;AltName:Full=Nuclearreceptorsubfamily1groupEmember1   363 443
ADD COMMENT

Login before adding your answer.

Traffic: 2650 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6