Question: Counting mismatched characters from sequences from a fasta file in comparison to a reference sequence
0
gravatar for nameuser
4 weeks ago by
nameuser10
nameuser10 wrote:

I am attempting to use Biopython to compare all the sequences (~400,000) to a reference sequence. I parsed the fasta file using: for record in SeqIO.parse("filename.fasta", "fasta").

Can someone help with the code to loop over the sequencing in the fasta file and compare it with a string, the reference sequences.

I hope to count +1 each time a character is mismatched.

Thank you!

biopython python fasta • 162 views
ADD COMMENTlink modified 4 weeks ago by Joe17k • written 4 weeks ago by nameuser10

This sounds like genome assembly. Why are you doing this in BioPython and not using an existing tool?

ADD REPLYlink written 4 weeks ago by RamRS28k

I think Ram is right and you'll want to look for another tool for this. It won't be difficult in python, but it probably will be slow with that much data.

Do you need the sequences to be aligned, or are they already the same length/aligned?

Sounds like you just need multiple pairwise Hamming or Levenshtein values.

ADD REPLYlink written 4 weeks ago by Joe17k
1
gravatar for Joe
4 weeks ago by
Joe17k
United Kingdom
Joe17k wrote:

This could be as simple as the below, but it has the following assumptions until OP clarifies:

  • Sequences are aligned or already the same length.
  • If they are aligned, they are provided as an aligned FASTA, rather than another alignment format.

If not, the Hamming functionality can be supplemented with an alignment step, but this will probably start to get slow with the OP's 400,000 sequences.


from Bio import SeqIO
import sys

def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))


for i, record in enumerate(SeqIO.parse(sys.argv[1], 'fasta')):
    if i == 0:
        reference = record
    print("\t".join([reference.id, record.id, str(hamming_distance(reference.seq, record.seq))]))

Example input:

>Zebrafish
ESLLRFGLRSDLDFRLSLNGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLENSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPV
>Fugu
ETVLSVGLSAETEISLSLNGSEPLEDTGQTLASCGIVSGDLIRVALIRAADAPDRDDGGG
HSEQVSQEAKLPDASGASTDSDQAPGPAASCWEPMLCSETDEGQAPWSLELLYHSAQVSG
PGDALVVAANLLMIETGFSPQDSQLKPAEMPAGWRCGGVYKLQYSHRLCGDSVVVMVAVS
MGSALIINGLLEVNQSADSVCKLCVDPSSYVTEWPGDSAAAAFKELNKLSRVFKDQVAYP
LITAARHAMALPVAFGLTALPPELLLRVFRLLDVRSVVMLSAVCRHFGAITRDTALWRHL
YCRDFRDSHAGSRDTDWKEVYRRSYKSRSAVRRSHECFLPPLYPNPRGVFTPPPPVPGII
GEYDQRPILPRPRYDPMSPFPDLDRQP
>Chicken
RALLAWGYSSDTEFSITLNGKDALTEDEKTLASYGIVPGDLICLLLEETDLPPPSSSPPS
LQNGKNGSSLEFPSGLVPEDVDLEEGTGSYPSEPMLCSEAADGEIPHSLEVLYLSAECTS
ATDALIVLVHLLMMETGYVPQGTEAKAVSMPEKWRGNGVYKLQYTHPLCEEGSAGLTCVP
LGDLVAINATLKINREIKGVKRIQLLPASFVCFQEPEKVAGVYKDLQKLSRLFKDQLVYS
LLAAARQALNLPDVFGLVVLPLELKLRIFRLLDVRSLISLSAVCRDLYAASNDQLLWRFM
YLRDFRDPIARPRDTDWKELYKKKLKQKEALRWRHMFLPPPFHPNPFYPSPFPIYPPMVI
GEYGERPSLIPPHFDPIGSLPGANPTL
>Zebra
SMTENRTAGSDTAFSVTLNRKDALTEDQKTLASYGIVSGDLICLLLEEPDLPPPPATPAP
LQNGNNGSSLEFPSGLVPEDADLEEGTGSYPSEPMLCSEAADGETPHSLEMLYLSAECTS
ATDALIVLVHLLMMETGYVPQGIEAKAVFMPEKWRGNGVYKLQYTHPLCGEGCAGLTCVP
LGDLIAINATLKINEEIRSVKRIQLLPSSFVCFQDPEKVAGVYKDLQKLSRLFKDQLVYS
LLAAARQALNLPDVFGLLVLPLELKLRIFRLLDVRSLISLSAVCRDLYTASNDQLLWRFM
YLRDFRDPIARPRDTDWKELYKKKLKQKEALRWRHMMLLPPFHPNPFYPNPFPIYPPMII
GEYDERPSLIPPHFDPIGSLPGANPML
>Anole
QALLSWGYSSETKFEITLNNKDSLVGDQDTLASFGIVSGDLICLILEDDASSPSSSLPSS
QSNHHSGPSQEFTSEGGPDDLDLQEATGSFPSEPMLCCEATDGQVPHSLQTLYHSAECTN
ANDALIVSIHLIMMETGYVPQGTEAKASSMPENWRNKGVYKLLYTHPLCENGFAVLTCVP
LGNLIVVNAMLKITSDIKSVKRLQLLPTSFICFQDSANVVGVYKDLQKLSRLFKDRLVYP
LLAAARQALNLPDVFGLVVLPLELKLRIFRLLDFRSLLSLSAVCHDLYAASNDQLLWRFI
YLRDFRDPVARSRDTDWKELYKKKMKQKDALRWRHMMFLPPLHPNPLYPNPFPLYPPMII
GEMDERPSLFPSHLDPFGSFQNPNPTL
>Human
QSLLTWGYSSNTRFTITLNYKDPLTGDEETLASYGIVSGDLICLILQDDIIPSSTSEHSS
LQNNSNGPSQNFEAESIQDNAHMAEGTGFYPSEPMLCSESVEGQVPHSLETLYQSADCSD
ANDALIVLIHLLMLESGYIPQGTEAKALSMPEKWKLSGVYKLQYMHPLCEGSSATLTCVP
LGNLIVVNATLKINNEIRSVKRLQLLPESFICKKLGENVANIYKDLQKLSRLFKDQLVYP
LLAFTRQALNLPDVFGLVVLPLELKLRIFRLLDVRSVLSLSAVCRDLFTASNDPLLWRFL
YLRDFRDNTVRVQDTDWKELYRKRHIQRESPKGRFVMLLPPFYPNPLHPRPFPRLPPGII
GEYDQRPSLIPPRFDPVGPLPGPNPIL

Example output:

Zebrafish   Zebrafish   0
Zebrafish   Fugu    214
Zebrafish   Chicken 223
Zebrafish   Zebra   226
Zebrafish   Anole   233
Zebrafish   Human   223
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Joe17k

I tried to add another if statement to remove the sequences with a hamming distance >50 but I was unable to write that in the file, any suggestions for outputing a new fasta file removing those with a count over 50?

ADD REPLYlink written 25 days ago by nameuser10

What did you try and in what way did it not work?

ADD REPLYlink written 25 days ago by Joe17k
from Bio import SeqIO
 import sys
 file = SeqIO.parse("file.fasta", "fasta")
 def hamming_distance(s1, s2):
      if len(s1) != len(s2):
         raise ValueError("ALIGN!")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

result = []
for i, record in enumerate(SeqIO.parse("file.fasta", 'fasta')):
     if i == 0:
          reference = record
          for group in file:
                hyper_mutated = group.split(",")
                 remove = []
                for seqs in remove:
                       if all(hamming_distance(seqs, base) > 50 for base in remove):
                              remove.append(seqs)
                     result.append(",".join(remove))
print(result)

I tried this, forgetting "SeqRecord has no attribute split." I don't believe SeqRecord has an attribute that does the same thing as split. Should I define a function split before 'results = [ ]' ?

ADD REPLYlink modified 24 days ago • written 24 days ago by nameuser10

I'm not sure what you're trying to achieve with opening the fasta file twice here? Why are you splitting it on a comma - there should be no comma delimiters in a fasta file?

An easier way would just be to add the records of interest to a new list for use later. I think you're overcomplicating the process.

I haven't tested this code, but it could/should be as simple as:

from Bio import SeqIO
import sys

def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

similar_seqs = []
for i, record in enumerate(SeqIO.parse(sys.argv[1], 'fasta')):
    if i == 0:
        reference = record
    if hamming_distance(reference.seq, record.seq) < 50:
        similar_seqs.append(record)

# Add some logic using SeqIO.write() to write a new file, or print out the sequences etc.

#    print("\t".join([reference.id, record.id, str(hamming_distance(reference.seq, record.seq))]))
ADD REPLYlink written 24 days ago by Joe17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 735 users visited in the last hour