Question: Changing output format for hamming distance of FASTA sequences (using Biopython)
0
gravatar for nameuser
11 days ago by
nameuser10
nameuser10 wrote:

Hi there, I currently have the code stated below. The output I currently have it set to write out is not what I'm looking for. However, this has been the only way I have successfully run the script. I would like to remove the sequences from a fasta file that have a hamming distance of 50 or more. I tried to use an additional for loop and write the output file with SeqIO but the way 'seqs' is currently written, it was not defined. I created an empty list and tried to append the results.

from Bio import SeqIO
REFERENCE_SEQ = 'ATGC.....'

def hamming_distance(seq):
    result = sum(b1 != b2 for b1, b2 in zip(REFERENCE_SEQ, seq))
    return result


 if __name__ == '__main__':
     filename = open('test.fasta', 'r')
    seqs = [record for record in SeqIO.parse(filename, 'fasta')]

     counted = list(map(lambda x: (x, hamming_distance(x)), seqs))
     results = list(filter(lambda x: x[1] < 50, counted))
     print(*results, sep='\n', file=open("output.fasta", "a"))

The current output is as follows:

(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGCAATATACT...CGG',SingleLetterAlphabet()), id='rev_comp_consensus', name='rev_comp_consensus', description='rev_comp_consensus', dbxrefs=[]), 7)
(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGTAATATACT...CCG',   SingleLetterAlphabet()), id='1', name='1', description='1', dbxrefs=[]), 7)
(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGCAATATACT...GGT',   SingleLetterAlphabet()), id='2_also7', name='2_also7', description='2_also7', dbxrefs=[]), 7)

Etc.

Please let me know if you have any ideas to

1. remove the sequences with a specified hamming distance

2. write the remaining sequences to a fasta file

Thank you so much.

ADD COMMENTlink modified 8 days ago by Joe17k • written 11 days ago by nameuser10

Are the three final lines of code indented properly?

ADD REPLYlink written 11 days ago by cschu1812.4k

oops, must've missed that when I wrote the question. I just fixed their indentation! Thanks.

ADD REPLYlink written 11 days ago by nameuser10

Does this work?

with open("out.fasta", "w") as f:
    for record in SeqIO.parse(open("test.fasta"):
        dh = hamming(str(record.seq))
        if dh < 50:
            print(">{id}\n{seq}".format(id=record.id, seq=record.seq), file=f)
ADD REPLYlink modified 11 days ago • written 11 days ago by cschu1812.4k

The logic definitely makes sense to me, I adjusted it a little for errors but the print line was extremely helpful! I'm getting a blank output right now though, when I did a matrix for the sequences, there were plenty (500,000 sequences per file) with less than 50. Additionally, dh is being defined.

Thanks again!

ADD REPLYlink modified 11 days ago • written 11 days ago by nameuser10

Could you post the corrections you've made, please?

ADD REPLYlink written 11 days ago by cschu1812.4k
 if __name__ == '__main__':
    with open(filename + "out.fasta", "w") as f:
        for record in (SeqIO.parse(filename, 'fasta')):
          if hamming_distance(str(record.seq)) < 50 :
                print(">{id}\n{seq}".format(id=record.id, seq=record.seq), file=f)
ADD REPLYlink modified 10 days ago • written 10 days ago by nameuser10
3
gravatar for Joe
8 days ago by
Joe17k
United Kingdom
Joe17k wrote:

I assume this is a follow-on from this thread? C: Counting mismatched characters from sequences from a fasta file in comparison to

If so, I gave you workable code, you just need to modify the output format and/or use SeqIO.write() as I suggested at the time.

from Bio import SeqIO
import sys

def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

similar_seqs = []
for i, record in enumerate(SeqIO.parse(sys.argv[1], 'fasta')):
    if i == 0:
        reference = record
    if hamming_distance(reference.seq, record.seq) < 50:
        similar_seqs.append(record)

with open("similar_seqs.fasta", "w") as output_handle:
    SeqIO.write(similar_seqs, output_handle, "fasta")

Input:

Note, I've added a modified entry at the end with 3 residues altered

>Zebrafish
ESLLRFGLRSDLDFRLSLNGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLENSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPV
..... Tuncated for character count....
>Anole
QALLSWGYSSETKFEITLNNKDSLVGDQDTLASFGIVSGDLICLILEDDASSPSSSLPSS
QSNHHSGPSQEFTSEGGPDDLDLQEATGSFPSEPMLCCEATDGQVPHSLQTLYHSAECTN
ANDALIVSIHLIMMETGYVPQGTEAKASSMPENWRNKGVYKLLYTHPLCENGFAVLTCVP
LGNLIVVNAMLKITSDIKSVKRLQLLPTSFICFQDSANVVGVYKDLQKLSRLFKDRLVYP
LLAAARQALNLPDVFGLVVLPLELKLRIFRLLDFRSLLSLSAVCHDLYAASNDQLLWRFI
YLRDFRDPVARSRDTDWKELYKKKMKQKDALRWRHMMFLPPLHPNPLYPNPFPLYPPMII
GEMDERPSLFPSHLDPFGSFQNPNPTL
>Human
QSLLTWGYSSNTRFTITLNYKDPLTGDEETLASYGIVSGDLICLILQDDIIPSSTSEHSS
LQNNSNGPSQNFEAESIQDNAHMAEGTGFYPSEPMLCSESVEGQVPHSLETLYQSADCSD
ANDALIVLIHLLMLESGYIPQGTEAKALSMPEKWKLSGVYKLQYMHPLCEGSSATLTCVP
LGNLIVVNATLKINNEIRSVKRLQLLPESFICKKLGENVANIYKDLQKLSRLFKDQLVYP
LLAFTRQALNLPDVFGLVVLPLELKLRIFRLLDVRSVLSLSAVCRDLFTASNDPLLWRFL
YLRDFRDNTVRVQDTDWKELYRKRHIQRESPKGRFVMLLPPFYPNPLHPRPFPRLPPGII
GEYDQRPSLIPPRFDPVGPLPGPNPIL
>Zebrafish_modified
ESLLRFGLRSDLDFRLSLSGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLANSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPL

Output:

$ cat similar_seqs.fa

>Zebrafish
ESLLRFGLRSDLDFRLSLNGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLENSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPV
>Zebrafish_modified
ESLLRFGLRSDLDFRLSLSGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLANSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPL

The reference sequence will always be included at present, but its easy to modify this to remove this if necessary.

ADD COMMENTlink modified 8 days ago • written 8 days ago by Joe17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1658 users visited in the last hour