Question: Changing output format for hamming distance of FASTA sequences (using Biopython)
0
gravatar for nameuser
5 months ago by
nameuser30
nameuser30 wrote:

Hi there, I currently have the code stated below. The output I currently have it set to write out is not what I'm looking for. However, this has been the only way I have successfully run the script. I would like to remove the sequences from a fasta file that have a hamming distance of 50 or more. I tried to use an additional for loop and write the output file with SeqIO but the way 'seqs' is currently written, it was not defined. I created an empty list and tried to append the results.

from Bio import SeqIO
REFERENCE_SEQ = 'ATGC.....'

def hamming_distance(seq):
    result = sum(b1 != b2 for b1, b2 in zip(REFERENCE_SEQ, seq))
    return result


 if __name__ == '__main__':
     filename = open('test.fasta', 'r')
    seqs = [record for record in SeqIO.parse(filename, 'fasta')]

     counted = list(map(lambda x: (x, hamming_distance(x)), seqs))
     results = list(filter(lambda x: x[1] < 50, counted))
     print(*results, sep='\n', file=open("output.fasta", "a"))

The current output is as follows:

(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGCAATATACT...CGG',SingleLetterAlphabet()), id='rev_comp_consensus', name='rev_comp_consensus', description='rev_comp_consensus', dbxrefs=[]), 7)
(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGTAATATACT...CCG',   SingleLetterAlphabet()), id='1', name='1', description='1', dbxrefs=[]), 7)
(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGCAATATACT...GGT',   SingleLetterAlphabet()), id='2_also7', name='2_also7', description='2_also7', dbxrefs=[]), 7)

Etc.

Please let me know if you have any ideas to

1. remove the sequences with a specified hamming distance

2. write the remaining sequences to a fasta file

Thank you so much.

ADD COMMENTlink modified 5 months ago by Joe18k • written 5 months ago by nameuser30

Are the three final lines of code indented properly?

ADD REPLYlink written 5 months ago by cschu1812.5k

oops, must've missed that when I wrote the question. I just fixed their indentation! Thanks.

ADD REPLYlink written 5 months ago by nameuser30

Does this work?

with open("out.fasta", "w") as f:
    for record in SeqIO.parse(open("test.fasta"):
        dh = hamming(str(record.seq))
        if dh < 50:
            print(">{id}\n{seq}".format(id=record.id, seq=record.seq), file=f)
ADD REPLYlink modified 5 months ago • written 5 months ago by cschu1812.5k

The logic definitely makes sense to me, I adjusted it a little for errors but the print line was extremely helpful! I'm getting a blank output right now though, when I did a matrix for the sequences, there were plenty (500,000 sequences per file) with less than 50. Additionally, dh is being defined.

Thanks again!

ADD REPLYlink modified 5 months ago • written 5 months ago by nameuser30

Could you post the corrections you've made, please?

ADD REPLYlink written 5 months ago by cschu1812.5k
 if __name__ == '__main__':
    with open(filename + "out.fasta", "w") as f:
        for record in (SeqIO.parse(filename, 'fasta')):
          if hamming_distance(str(record.seq)) < 50 :
                print(">{id}\n{seq}".format(id=record.id, seq=record.seq), file=f)
ADD REPLYlink modified 5 months ago • written 5 months ago by nameuser30
3
gravatar for Joe
5 months ago by
Joe18k
United Kingdom
Joe18k wrote:

I assume this is a follow-on from this thread? C: Counting mismatched characters from sequences from a fasta file in comparison to

If so, I gave you workable code, you just need to modify the output format and/or use SeqIO.write() as I suggested at the time.

from Bio import SeqIO
import sys

def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

similar_seqs = []
for i, record in enumerate(SeqIO.parse(sys.argv[1], 'fasta')):
    if i == 0:
        reference = record
    if hamming_distance(reference.seq, record.seq) < 50:
        similar_seqs.append(record)

with open("similar_seqs.fasta", "w") as output_handle:
    SeqIO.write(similar_seqs, output_handle, "fasta")

Input:

Note, I've added a modified entry at the end with 3 residues altered

>Zebrafish
ESLLRFGLRSDLDFRLSLNGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLENSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPV
..... Tuncated for character count....
>Anole
QALLSWGYSSETKFEITLNNKDSLVGDQDTLASFGIVSGDLICLILEDDASSPSSSLPSS
QSNHHSGPSQEFTSEGGPDDLDLQEATGSFPSEPMLCCEATDGQVPHSLQTLYHSAECTN
ANDALIVSIHLIMMETGYVPQGTEAKASSMPENWRNKGVYKLLYTHPLCENGFAVLTCVP
LGNLIVVNAMLKITSDIKSVKRLQLLPTSFICFQDSANVVGVYKDLQKLSRLFKDRLVYP
LLAAARQALNLPDVFGLVVLPLELKLRIFRLLDFRSLLSLSAVCHDLYAASNDQLLWRFI
YLRDFRDPVARSRDTDWKELYKKKMKQKDALRWRHMMFLPPLHPNPLYPNPFPLYPPMII
GEMDERPSLFPSHLDPFGSFQNPNPTL
>Human
QSLLTWGYSSNTRFTITLNYKDPLTGDEETLASYGIVSGDLICLILQDDIIPSSTSEHSS
LQNNSNGPSQNFEAESIQDNAHMAEGTGFYPSEPMLCSESVEGQVPHSLETLYQSADCSD
ANDALIVLIHLLMLESGYIPQGTEAKALSMPEKWKLSGVYKLQYMHPLCEGSSATLTCVP
LGNLIVVNATLKINNEIRSVKRLQLLPESFICKKLGENVANIYKDLQKLSRLFKDQLVYP
LLAFTRQALNLPDVFGLVVLPLELKLRIFRLLDVRSVLSLSAVCRDLFTASNDPLLWRFL
YLRDFRDNTVRVQDTDWKELYRKRHIQRESPKGRFVMLLPPFYPNPLHPRPFPRLPPGII
GEYDQRPSLIPPRFDPVGPLPGPNPIL
>Zebrafish_modified
ESLLRFGLRSDLDFRLSLSGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLANSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPL

Output:

$ cat similar_seqs.fa

>Zebrafish
ESLLRFGLRSDLDFRLSLNGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLENSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPV
>Zebrafish_modified
ESLLRFGLRSDLDFRLSLSGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLANSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPL

The reference sequence will always be included at present, but its easy to modify this to remove this if necessary.

ADD COMMENTlink modified 5 months ago • written 5 months ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1525 users visited in the last hour
_