Question

How To Use Split Command In Biopython

1

Entering edit mode

13.5 years ago

Hari ▴ 280

Hi all, I have some 70 protein domains that have been found using HMMER3 that have shorter residues,the domains are short of 3 or 4 residues than that of the domains in the database.I want to write a program in biopython to retrieve the missing residues

Hmmer sequence:

tr|E7EWP2 KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE

database record sequence.

tr|E7EWP2 ARRKEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVEEIP

First i would like to split my part of the sequence from the original database sequence but i have problems with this.

from Bio import SeqIO

db1 = "sample_db.fasta"  # contains db_records
db2 = "sample.fasta"     # contains my result

dbase_dict = SeqIO.read(db1, "fasta")
my_record_dict = SeqIO.read(db2, "fasta")

for record in my_record_dict :
    if record in dbase_dict:
       print dbase_dict.seq.split(my_record_dict.seq)
       rec_dbase = dbase_dict[record]
         rec_mine = my_record_dict[record]  
         list_seq = rec_dbase.seq.split("rec_mine.seq")

i would like to get list_seq = [ 'ARR' 'KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE' 'EIP' ] and then i would use strip command to retrieve the first and the last 3 residues .But split does not work.

Thanks in advance

domain retrieval hmmer biopython split • 5.1k views

ADD COMMENT • link updated 13.5 years ago by Eric T. ★ 2.8k • written 13.5 years ago by Hari ▴ 280

0

Entering edit mode

UniProt is a repository of proteins, not domains (though SMART/PFAM/...) may be linked. So please clarify this part. Can't you use the full domains from the source dbs once you have identified them?

ADD REPLY • link 13.5 years ago by Michael Kuhn 5.0k

0

Entering edit mode

yes but we just need the domains not the whole sequences extracting it with the ids is painful,i m trying to write some code that can do this.

ADD REPLY • link 13.5 years ago by Hari ▴ 280

score 2 · Answer 1 · 2012-01-03

I guess what you mean is this: When using hmmsearch for the Pkinase domain from PFAM against SRC, the output is:

>> SRC
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !  169.7   0.0   4.1e-54   4.1e-54       3     254 ..     272     515 ..     270     522 .. 0.88
  Alignments for each domain:  
  == domain 1    score: 169.7 bits;  conditional E-value: 4.1e-54
              EEEEEE----EEEEEEEETTCCEEEEEEEEEGGGHHHHHHHHHHHHHHHHHHHH-TTB--EEEEEEESTEEEEEEE--SC-EHHHHHH..HHCCB-HHHHHH CS
  pkinase   3 lleklGsGsfGkVykakkkktgkkvAvKilkkeeekskkektavrElkilkklsHpnivkllevfeekdelylvleyveggdlfdllk..kegklseeeikk 102
              l+ klG+G fG+V+ +++++t + vA+K+lk  + + ++  +   E++++kkl+H+ +v+l++v+ +++ +y+v+ey+ +g+l d+lk  + + l+  ++ +
      SRC 272 LEVKLGQGCFGEVWMGTWNGTTR-VAIKTLKPGTMSPEAFLQ---EAQVMKKLRHEKLVQLYAVV-SEEPIYIVTEYMSKGSLLDFLKgeTGKYLRLPQLVD 368
              6779*****************96.************999755...*******************9.6889******************9888899******* PP

              HHHHHHHHHHHHHHTCEE-S--SGGGEEEETT--EEE------EEE.CEESS-CSS----GGGS-HHHHHTSHHHHHHHHHHHHHHHHHHHHC.---TTTST CS
  pkinase 103 ialqilegleylHsngiiHrDLKpeNiLldkkgelkiaDFGlakkl.eksseklttlvgtreYmAPEvllkakeytkkvDvWslGvilyellt.gklpfsge 202
              +a qi++g++y+ +++ +HrDL  +NiL+ ++ ++k+aDFGla+ +  ++++  +      ++ APE  l     t k+DvWs+G++l el+t g++p++g
      SRC 369 MAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIeDNEYTARQGAKFPIKWTAPEAAL-YGRFTIKSDVWSFGILLTELTTkGRVPYPGM 469
              ********************************************764444445555566679********.999*****************99678999999 PP

              HHHHHHHHHCCCHHTTS--GGGGCSHHHHHHHHHHHHHT-SSGGGS--HHHH CS
  pkinase 203 seedqleliekilkkkleedepkssskseelkdlikkllekdpakRltaeei 254
               +++ l  +e+  +     ++p +++  e+l dl++++++k+p++R+t+e +
      SRC 470 VNREVLDQVERGYRM----PCPPECP--ESLHDLMCQCWRKEPEERPTFEYL 515
              555555555555554....4454544..89******************9865 PP

That is, it starts with residue 3. When you use hmmalign --alllcol, you see that hmmer doesn't want to include the leucine and arginine at the start:

# STOCKHOLM 1.0

SRC          mgsnkskpkdasqrrrslepaenvhgagggafpasqtpskpasadghrgp
#=GR SRC PP  **************************************************
#=GC PP_cons ..................................................
#=GC RF      ..................................................

SRC          saafapaaaepklfggfnssdtvtspqragplaggvttfvalydyesrte
#=GR SRC PP  **************************************************
#=GC PP_cons ..................................................
#=GC RF      ..................................................

SRC          tdlsfkkgerlqivnntegdwwlahslstgqtgyipsnyvapsdsiqaee
#=GR SRC PP  **************************************************
#=GC PP_cons ..................................................
#=GC RF      ..................................................

SRC          wyfgkitrreserlllnaenprgtflvresettkgayclsvsdfdnakgl
#=GR SRC PP  **************************************************
#=GC PP_cons ..................................................
#=GC RF      ..................................................

SRC          nvkhykirkldsggfyitsrtqfnslqqlvayyskhadglchrlttvcpt
#=GR SRC PP  **************************************************
#=GC PP_cons ..................................................
#=GC RF      ..................................................

SRC          skpqtqglakdaweipreslr--LEVKLGQGCFGEVWMGTWNGTTR-VAI
#=GR SRC PP  *******************75..6779*****************96.***
#=GC PP_cons .......................6779*****************96.***
#=GC RF      .....................xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

SRC          KTLKPGTMSPEAFLQ---EAQVMKKLRHEKLVQLYAVV-SEEPIYIVTEY
#=GR SRC PP  *********999755...*******************9.6889*******
#=GC PP_cons *********999755...*******************9.6889*******
#=GC RF      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

SRC          MSKGSLLDFLKgeTGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAA
#=GR SRC PP  ***********9888899********************************
#=GC PP_cons ***********..88899********************************
#=GC RF      xxxxxxxxxxx..xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

SRC          NILVGENLVCKVADFGLARLIeDNEYTARQGAKFPIKWTAPEAAL-YGRF
#=GR SRC PP  *******************764444445555566679********.999*
#=GC PP_cons *******************76.444445555566679********.999*
#=GC RF      xxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxx

SRC          TIKSDVWSFGILLTELTTkGRVPYPGMVNREVLDQVERGYRM----PCPP
#=GR SRC PP  ****************99678999999555555555555554....4454
#=GC PP_cons ****************99.78999999555555555555554....4454
#=GC RF      xxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

SRC          ECP--ESLHDLMCQCWRKEPEERPTFEYL------qafledyftstepqy
#=GR SRC PP  544..89******************8765......56889999*******
#=GC PP_cons 544..89******************8765.....................
#=GC RF      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...............

SRC          qpgenl
#=GR SRC PP  ******
#=GC PP_cons ......
#=GC RF      ......
//

So one thing you could do to force your protein into the domain model is to count the "-" at the beginning and the end and include the correct number of residues.

score 2 · Answer 2 · 2012-01-06

Problem: extract just the 255-aa protein kinase domains from a set of 70 sequences.

You don't really need to use the split() method; slicing the original sequence should do it once you've calculated the sequence coordinates.

If you're already parsing some of HMMer's plain-text output, you can look at the fields "hmmfrom" and "hmm to" to see how many positions are missing from the start and end of the sequence, then look at the "alifrom" and "ali to" fields to determine the positions of the original sequence that matched the HMM profile. Then, simple arithmetic to get the coordinates in the original sequence:

realstart = alifrom - hmmstart
realend = ali_to + (255 - hmm_to)
realdomain = rec_mine[realstart:realend]

The sequence coordinates of the aligned region also appear in the Stockholm output, in the sequence identifier after the slash character -- so you can also count the number of terminal dashes, minus any leading or trailing insert (lowercase) characters.

Keep in mind that for some purposes, the missing sequence positions might not be meaningful -- e.g. many biologically functional kinases are missing one or more subdomains, so the sequence portions that were left out by the HMM profile could be phylogenetically or structurally unrelated to the typical kinase subdomain. Though if the missing portions are only 3-4 aa I think you're fine, it's just HMMer being stubborn.

score 1 · Answer 3 · 2012-01-06

1

Entering edit mode

13.5 years ago

Dpsguy ▴ 140

I dunno much about HMMER but you could attempt a biopython solution along these lines (since your hmmer result sequence is "included" in the database sequence):

from Bio import SeqIO
import re

hmmer_seq_string = str(hmmer_result.seq)
dbase_seq_string = str(dbase_record.seq)
match = re.search(hmmer_seq_string, dbase_seq_string)
s = match.start()
e = match.end()
first_residues = dbase_seq_string[:s]
last_residues = dbase_seq_string[e:]
print first_residues
print last_residues