Hi guys,
I have 241 nucleotide sequences (~1500bp) that I would like to calculate all pairwise sequence identities for.
I wrote a tool for this, but it keeps crashing for some reason.
Does anyone know of a (online) tool that will allow me to get this information?
thanks
from itertools import combinations
from Bio import SeqIO
from Bio import pairwise2
seqs = SeqIO.to_dict(SeqIO.parse(open('file.fasta'),'fasta'))
for sr1, sr2 in combinations(seqs, 2):
aln = pairwise2.align.globalxx(str(seqs[sr1].seq), str(seqs[sr2].seq))[0]
print sr1, sr2, aln[2]/float(aln[4])*100
thanks, it seems to be working well for the example you provided. however, Running two 1500bp alignments causes my entire machine to freeze. i did not expect pairwise2 do be this memory intensive...
For two sequences of length 1500bp, typical implementation requires 1500x1500=2.25MB memory. Even a careless implementation will not use more than 3x4x2.25MB~30MB. This is the memory if you use the programs I recommended. The core of pairwise2 is implemented in C. I would not expect RAM to be a problem. Nonetheless, I do not use pairwise2. What I said above may not be applicable to it.
thanks. I took a.zielinski's framework and edited it to use muscle alignments. It works pretty well. I will post my code when I get home tonight! I agree, I am surprised at the memory use of pairwise2. However, both on my PC at work and my mac at home it eats up all memory and takes forever...
Why not use ssearch from FASTA3 or swat from phrap?