Counting mismatched characters from sequences from a fasta file in comparison to a reference sequence
1
0
Entering edit mode
3.9 years ago
nameuser ▴ 30

I am attempting to use Biopython to compare all the sequences (~400,000) to a reference sequence. I parsed the fasta file using: for record in SeqIO.parse("filename.fasta", "fasta").

Can someone help with the code to loop over the sequencing in the fasta file and compare it with a string, the reference sequences.

I hope to count +1 each time a character is mismatched.

Thank you!

biopython python fasta • 1.9k views
ADD COMMENT
0
Entering edit mode

This sounds like genome assembly. Why are you doing this in BioPython and not using an existing tool?

ADD REPLY
0
Entering edit mode

I think Ram is right and you'll want to look for another tool for this. It won't be difficult in python, but it probably will be slow with that much data.

Do you need the sequences to be aligned, or are they already the same length/aligned?

Sounds like you just need multiple pairwise Hamming or Levenshtein values.

ADD REPLY
1
Entering edit mode
3.9 years ago
Joe 21k

This could be as simple as the below, but it has the following assumptions until OP clarifies:

  • Sequences are aligned or already the same length.
  • If they are aligned, they are provided as an aligned FASTA, rather than another alignment format.

If not, the Hamming functionality can be supplemented with an alignment step, but this will probably start to get slow with the OP's 400,000 sequences.


from Bio import SeqIO
import sys

def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))


for i, record in enumerate(SeqIO.parse(sys.argv[1], 'fasta')):
    if i == 0:
        reference = record
    print("\t".join([reference.id, record.id, str(hamming_distance(reference.seq, record.seq))]))

Example input:

>Zebrafish
ESLLRFGLRSDLDFRLSLNGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLENSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPV
>Fugu
ETVLSVGLSAETEISLSLNGSEPLEDTGQTLASCGIVSGDLIRVALIRAADAPDRDDGGG
HSEQVSQEAKLPDASGASTDSDQAPGPAASCWEPMLCSETDEGQAPWSLELLYHSAQVSG
PGDALVVAANLLMIETGFSPQDSQLKPAEMPAGWRCGGVYKLQYSHRLCGDSVVVMVAVS
MGSALIINGLLEVNQSADSVCKLCVDPSSYVTEWPGDSAAAAFKELNKLSRVFKDQVAYP
LITAARHAMALPVAFGLTALPPELLLRVFRLLDVRSVVMLSAVCRHFGAITRDTALWRHL
YCRDFRDSHAGSRDTDWKEVYRRSYKSRSAVRRSHECFLPPLYPNPRGVFTPPPPVPGII
GEYDQRPILPRPRYDPMSPFPDLDRQP
>Chicken
RALLAWGYSSDTEFSITLNGKDALTEDEKTLASYGIVPGDLICLLLEETDLPPPSSSPPS
LQNGKNGSSLEFPSGLVPEDVDLEEGTGSYPSEPMLCSEAADGEIPHSLEVLYLSAECTS
ATDALIVLVHLLMMETGYVPQGTEAKAVSMPEKWRGNGVYKLQYTHPLCEEGSAGLTCVP
LGDLVAINATLKINREIKGVKRIQLLPASFVCFQEPEKVAGVYKDLQKLSRLFKDQLVYS
LLAAARQALNLPDVFGLVVLPLELKLRIFRLLDVRSLISLSAVCRDLYAASNDQLLWRFM
YLRDFRDPIARPRDTDWKELYKKKLKQKEALRWRHMFLPPPFHPNPFYPSPFPIYPPMVI
GEYGERPSLIPPHFDPIGSLPGANPTL
>Zebra
SMTENRTAGSDTAFSVTLNRKDALTEDQKTLASYGIVSGDLICLLLEEPDLPPPPATPAP
LQNGNNGSSLEFPSGLVPEDADLEEGTGSYPSEPMLCSEAADGETPHSLEMLYLSAECTS
ATDALIVLVHLLMMETGYVPQGIEAKAVFMPEKWRGNGVYKLQYTHPLCGEGCAGLTCVP
LGDLIAINATLKINEEIRSVKRIQLLPSSFVCFQDPEKVAGVYKDLQKLSRLFKDQLVYS
LLAAARQALNLPDVFGLLVLPLELKLRIFRLLDVRSLISLSAVCRDLYTASNDQLLWRFM
YLRDFRDPIARPRDTDWKELYKKKLKQKEALRWRHMMLLPPFHPNPFYPNPFPIYPPMII
GEYDERPSLIPPHFDPIGSLPGANPML
>Anole
QALLSWGYSSETKFEITLNNKDSLVGDQDTLASFGIVSGDLICLILEDDASSPSSSLPSS
QSNHHSGPSQEFTSEGGPDDLDLQEATGSFPSEPMLCCEATDGQVPHSLQTLYHSAECTN
ANDALIVSIHLIMMETGYVPQGTEAKASSMPENWRNKGVYKLLYTHPLCENGFAVLTCVP
LGNLIVVNAMLKITSDIKSVKRLQLLPTSFICFQDSANVVGVYKDLQKLSRLFKDRLVYP
LLAAARQALNLPDVFGLVVLPLELKLRIFRLLDFRSLLSLSAVCHDLYAASNDQLLWRFI
YLRDFRDPVARSRDTDWKELYKKKMKQKDALRWRHMMFLPPLHPNPLYPNPFPLYPPMII
GEMDERPSLFPSHLDPFGSFQNPNPTL
>Human
QSLLTWGYSSNTRFTITLNYKDPLTGDEETLASYGIVSGDLICLILQDDIIPSSTSEHSS
LQNNSNGPSQNFEAESIQDNAHMAEGTGFYPSEPMLCSESVEGQVPHSLETLYQSADCSD
ANDALIVLIHLLMLESGYIPQGTEAKALSMPEKWKLSGVYKLQYMHPLCEGSSATLTCVP
LGNLIVVNATLKINNEIRSVKRLQLLPESFICKKLGENVANIYKDLQKLSRLFKDQLVYP
LLAFTRQALNLPDVFGLVVLPLELKLRIFRLLDVRSVLSLSAVCRDLFTASNDPLLWRFL
YLRDFRDNTVRVQDTDWKELYRKRHIQRESPKGRFVMLLPPFYPNPLHPRPFPRLPPGII
GEYDQRPSLIPPRFDPVGPLPGPNPIL

Example output:

Zebrafish   Zebrafish   0
Zebrafish   Fugu    214
Zebrafish   Chicken 223
Zebrafish   Zebra   226
Zebrafish   Anole   233
Zebrafish   Human   223
ADD COMMENT
0
Entering edit mode

I tried to add another if statement to remove the sequences with a hamming distance >50 but I was unable to write that in the file, any suggestions for outputing a new fasta file removing those with a count over 50?

ADD REPLY
0
Entering edit mode

What did you try and in what way did it not work?

ADD REPLY
0
Entering edit mode
from Bio import SeqIO
 import sys
 file = SeqIO.parse("file.fasta", "fasta")
 def hamming_distance(s1, s2):
      if len(s1) != len(s2):
         raise ValueError("ALIGN!")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

result = []
for i, record in enumerate(SeqIO.parse("file.fasta", 'fasta')):
     if i == 0:
          reference = record
          for group in file:
                hyper_mutated = group.split(",")
                 remove = []
                for seqs in remove:
                       if all(hamming_distance(seqs, base) > 50 for base in remove):
                              remove.append(seqs)
                     result.append(",".join(remove))
print(result)

I tried this, forgetting "SeqRecord has no attribute split." I don't believe SeqRecord has an attribute that does the same thing as split. Should I define a function split before 'results = [ ]' ?

ADD REPLY
0
Entering edit mode

I'm not sure what you're trying to achieve with opening the fasta file twice here? Why are you splitting it on a comma - there should be no comma delimiters in a fasta file?

An easier way would just be to add the records of interest to a new list for use later. I think you're overcomplicating the process.

I haven't tested this code, but it could/should be as simple as:

from Bio import SeqIO
import sys

def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

similar_seqs = []
for i, record in enumerate(SeqIO.parse(sys.argv[1], 'fasta')):
    if i == 0:
        reference = record
    if hamming_distance(reference.seq, record.seq) < 50:
        similar_seqs.append(record)

# Add some logic using SeqIO.write() to write a new file, or print out the sequences etc.

#    print("\t".join([reference.id, record.id, str(hamming_distance(reference.seq, record.seq))]))
ADD REPLY

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6