I have some 70 protein domains that have been found using HMMER3 that have shorter residues,the domains are short of 3 or 4 residues than that of the domains in the database.I want to write a program in biopython to retrieve the missing residues
First i would like to split my part of the sequence from the original database sequence but i have problems with this.
from Bio import SeqIO
db1 = "sample_db.fasta" # contains db_records
db2 = "sample.fasta" # contains my result
dbase_dict = SeqIO.read(db1, "fasta")
my_record_dict = SeqIO.read(db2, "fasta")
for record in my_record_dict :
if record in dbase_dict:
rec_dbase = dbase_dict[record]
rec_mine = my_record_dict[record]
list_seq = rec_dbase.seq.split("rec_mine.seq")
i would like to get list_seq = [ 'ARR' 'KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE' 'EIP' ] and then i would use strip command to retrieve the first and the last 3 residues .But split does not work.
Problem: extract just the 255-aa protein kinase domains from a set of 70 sequences.
You don't really need to use the split() method; slicing the original sequence should do it once you've calculated the sequence coordinates.
If you're already parsing some of HMMer's plain-text output, you can look at the fields "hmmfrom" and "hmm to" to see how many positions are missing from the start and end of the sequence, then look at the "alifrom" and "ali to" fields to determine the positions of the original sequence that matched the HMM profile. Then, simple arithmetic to get the coordinates in the original sequence:
The sequence coordinates of the aligned region also appear in the Stockholm output, in the sequence identifier after the slash character -- so you can also count the number of terminal dashes, minus any leading or trailing insert (lowercase) characters.
Keep in mind that for some purposes, the missing sequence positions might not be meaningful -- e.g. many biologically functional kinases are missing one or more subdomains, so the sequence portions that were left out by the HMM profile could be phylogenetically or structurally unrelated to the typical kinase subdomain. Though if the missing portions are only 3-4 aa I think you're fine, it's just HMMer being stubborn.