string search in fastq file
1
0
Entering edit mode
7.3 years ago

I have a list of 20 bp sequences stored in a list called known. I have a fastq file called A.fastq, from which I cut out the relevant 20 bp region and compare against the known list. I would like to use Hamming distance as a metric for choosing the correct match i.e for every fastq read, I want to find the string in the known list which has is the least Hamming distance apart and update a counter good_count only if the least Hamming distance is <= 3. This is my code, but this does not account for Hamming distance criteria. Can you suggest what I should do?

import Bio
from Bio import SeqIO

known = set()
for s in Bio.SeqIO.parse("lib.fa","fasta"):
        known.add(str(s.seq[10:20]))

good_count = 0
for r in Bio.SeqIO.parse("read1.fastq","fastq"):
        if str(r.seq[10:20]) in known:
                good_count += 1
print good_count

Specifically, I want help at the "if" statement to compare string by string from the list "known"

sequencing • 2.1k views
ADD COMMENT
0
Entering edit mode

Look into the python distance package.

ADD REPLY
1
Entering edit mode
7.3 years ago

The itertools.imap() method runs the operator.ne function on each pair of characters in the strings a and b. The sum of mismatches (True -> 1) and matches (False -> 0) gives the distance score.

import itertools
import operator

def hamming(a, b):
    return sum( itertools.imap( operator.ne, a, b ) )
ADD COMMENT

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6