Hi all, I have some 70 protein domains that have been found using HMMER3 that have shorter residues,the domains are short of 3 or 4 residues than that of the domains in the database.I want to write a program in biopython to retrieve the missing residues
Hmmer sequence:
tr|E7EWP2 KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE
database record sequence.
tr|E7EWP2 ARRKEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVEEIP
First i would like to split my part of the sequence from the original database sequence but i have problems with this.
from Bio import SeqIO
db1 = "sample_db.fasta" # contains db_records
db2 = "sample.fasta" # contains my result
dbase_dict = SeqIO.read(db1, "fasta")
my_record_dict = SeqIO.read(db2, "fasta")
for record in my_record_dict :
if record in dbase_dict:
print dbase_dict.seq.split(my_record_dict.seq)
rec_dbase = dbase_dict[record]
rec_mine = my_record_dict[record]
list_seq = rec_dbase.seq.split("rec_mine.seq")
i would like to get list_seq = [ 'ARR' 'KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE' 'EIP' ] and then i would use strip command to retrieve the first and the last 3 residues .But split does not work.
Thanks in advance
UniProt is a repository of proteins, not domains (though SMART/PFAM/...) may be linked. So please clarify this part. Can't you use the full domains from the source dbs once you have identified them?
yes but we just need the domains not the whole sequences extracting it with the ids is painful,i m trying to write some code that can do this.