How to search the human genome for sequences that differ from a given sequence by a set number of mismatches?
Part of a program I'm writing involves taking in an input query sequence that is between 18-21 bp and finding all, or as many as possible, sequences that differ from that sequence by 1, 2, 3, and 4 bp. Basically any alignments with 1, 2, 3, or 4 mismatches. I've been trying to use blastn to do this so far, but the problem is that I'm only getting 1 hit, the actual place where the query sequence aligns. It seems that BLAST doesn't let you specify how many mismatches you want to allow in the sequence. Is there maybe a way to set the score cutoff for what is considered a hit to be lower? I'm also open to suggestions on other ways to do this not using BLAST, since I've experimented a lot with it and had little success. Thanks in advance for any advice people give!

BLAST operates with E-value thresholds. If there are multiple hits for your sequences that satisfy the E-value threshold, with or without mismatches, BLAST would identify them. I think the default threshold is E=10. Depending on the E-value of your perfect match, you may want to increase the threshold to 100 or so. It is possible that there are no hits with 1-4 mismatches that BLAST can identify with its default parameters, so you may have to play with word size as well. If I remember correctly, smaller word sizes will make the search slower but more sensitive to short matches.

Simple and stupide solution in C. Not really tested.

compilation:

gcc -Wall -o biostar9528296 biostar9528296.c


usage:

./biostar9528296 ATATCATCGTACTAGCGATGTTTAGGGGA 2 < in.fa