Biopython - Calculation Used In Motif Building
1
0
Entering edit mode
11.7 years ago
Olivier ▴ 440

I was trying to build a motif using Biopython 1.58 and I tried to cross-check the amino acid frequency from one column out of curiosity. Why do the values not correspond to the 'exact' value? e.g. given for 1st column there are amino acids: N,S,P,S,V,P,S,N,I,I,I,H,V,V,L. The output of the motif object is: {... 'N': 0.1279761904761905, 'P': 0.1279761904761905, 'S': 0.19047619047619052, ...}. Shoudn't 'N' be 2/15 = 0.06667, 'P' be 2/15 = 0.1333, and 'S' be 3/15 = 0.2? May be I misunderstood, or is there some other transformation done to the amino acid frequency used to build the sequence motif? Grateful if someone could clarify.

Thanks

biopython motif • 3.8k views
ADD COMMENT
0
Entering edit mode

According to your example, N should also be 2/15 and indeed has the same value as P. We also need more details such as: which Biopython module was used?

ADD REPLY
0
Entering edit mode

You're right. My mistake. Thanks. I used the Bio.Motif module.

ADD REPLY
0
Entering edit mode

Further to Neil's helpful comment, could you actually show us the code you are using with Biopython to get these numbers?

ADD REPLY
1
Entering edit mode

Here's the script. Still learning..

from Bio.Motif import Motif
from Bio.Alphabet import IUPAC, Gapped
from Bio.Align.Applications import MuscleCommandline
from Bio import SeqIO, AlignIO, Entrez
import os

#preparing script for communicating with NCBI's Entrez 
os.environ['http_proxy'] = 'http://130.14.29.110:80'
Entrez.email = 'myemailaddress'

#searching for wrky sequence
search_handle = Entrez.esearch(term='wrky AND (tomato[orgn] OR potato[orgn] OR pepper[orgn] OR arabidopsis[orgn] OR)',
                        db='protein')

id_list = Entrez.read(search_handle)['IdList'][:15]

#Retrieving sequences
gb_handle = Entrez.efetch(db='protein', id=id_list, rettype='gb')

fasta_records = SeqIO.convert(gb_handle, 'gb', '15_wrky.fas', 'fasta')

#Aligning sequences
muscle_cline = MuscleCommandline(input='/home/olivier/15_wrky.fas',
                             out='/home/olivier/15_wrky.aln',
                             clwstrict=True)
stdout, stderr = muscle_cline()

#Loading the msa and building a sequence motif
alphabet = Gapped(IUPAC.protein)
alignment_file = AlignIO.read('/home/olivier/15_wrky.aln','clustal',
                              alphabet=alphabet)
m = Motif(alphabet=alphabet)

for sequence in alignment_file:
    m.add_instance(sequence.seq[367:430])

print(m.pwm())
ADD REPLY
0
Entering edit mode

I edited this by indenting lines of code with 4 spaces (to display code properly).

ADD REPLY
2
Entering edit mode
11.7 years ago
Neilfws 49k

is there some other transformation done to the amino acid frequency used to build the sequence motif

I believe that the answer is "yes" and a clue as to the transformation is found in the source code for the pwm method:

if laplace=True (default), pseudocounts equal to self.background multiplied by self.beta are added to all positions.

ADD COMMENT
1
Entering edit mode

That text is in the docstring, so doing help(m.pwm) or help(m) would have told you this (where m is a Motif object as in the example).

ADD REPLY
0
Entering edit mode

Thanks Peter. I had not thought of it.

ADD REPLY
0
Entering edit mode

If you find docstrings missing important information please let us know or suggest improvements. Thanks!

ADD REPLY
0
Entering edit mode

i.e. If you want raw counts, use:

m.pwm(laplace=False)
ADD REPLY
0
Entering edit mode

Thanks a lot Peter for clarifying. I'm still going through the Biopython tutorial as a preparation for my research project. Hope that my questions are not too basic, as I'm from a molecular biology background.

ADD REPLY

Login before adding your answer.

Traffic: 1412 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6