Question: How To Find The Locations Of A Short Specific Sequence In A Genome With 1 Or 2 Mismatches Allowed?
gravatar for William
7.6 years ago by
William4.8k wrote:

We have a 23 nucleotide CRISPR target sequence of which I would like to find out if it also present in other locations in the genome.

The sequences directs a CRISPR RNA construct to introduce a indel mutation in the genome and we would like to make sure that there is only one target loci. There is also one N in the nucleotide sequence.

Let's say the 23 nucleotide sequence is :


How do I find all the loci in a genome were this sequence matches, exactly (well 1 mismatch one the N), or with say an edit distance of 2 or 3?

I tried BWA aln with a short sequence of 23 bp from the human genome with parameters -l 23 -k 2 but it didn't find back the location of the 23 bp. Does bwa work with sequences of this lenght?

I tried blast but I get back a lot of results and I can't control the max edit distance.

sequence bwa blast • 4.9k views
ADD COMMENTlink modified 2.1 years ago by Johan Zicola60 • written 7.6 years ago by William4.8k

PatMatch allows controlling the number of mismatches and whether that includes insertions, deletions, and/or substitutions. There is a stand-alone version of the software available as posted about here in response to a related question. (In fact, at the referenced resource you can run it right in your browser right now via Jupyter environment served by As far as I can tell, it cannot fine-tune specifying how to break down that number further to say 2 substitutions and 1 deletion max.

ADD REPLYlink written 2.8 years ago by Wayne420

but it looks like PatMatch only works for Arabidopsis

ADD REPLYlink written 2.5 years ago by c_u290

@chahat_u PatMatch definitely isn't limited to Arabidopsis. Look at the other post I pointed at here. There are several web sites offering PatMatch working as a web tool for quite a few organisms beyond Arabidopsis. I list the ones I could find here. Additionally, as long as you have the sequence and go to and launch a binder session there, you can follow along with the example I set up and use another genome.

ADD REPLYlink written 2.5 years ago by Wayne420
gravatar for Maximilian Haeussler
6.1 years ago by
Maximilian Haeussler1.4k wrote:

Yes, bwa will find it, but you need to change the parameters. Do not use the seeded mode, use the slower -N mode:

bwa aln -n 4 -o 0 -k 4 -N

The sanger CRISPR site uses more or less these parameters.

ADD COMMENTlink written 6.1 years ago by Maximilian Haeussler1.4k

Hi, I tried your method to find the genomic location of a DNA sequence in the hg19 genome, and I ran the following command -

bwa aln -n 4 -o 0 -k 4 -N hg19.fasta testmotif.fq > out.sai

But the out.sai file seemed to only have illegible stuff in it -

SAI  ÄÑÄø ˇˇˇ

Do you have some idea as to what could be going wrong?

ADD REPLYlink written 2.5 years ago by c_u290
gravatar for Johan Zicola
2.1 years ago by
Johan Zicola60
Johan Zicola60 wrote:

Using Bowtie (for example v1.2.2 here) to find off-targets for defined CRISPR-Cas9 target sequences:

Make the Bowtie index for your genome (fasta file format)

bowtie-build -f genome.fa  genome_prefix

Search for your target sequence by allowing 1 mismatch (for your N) with the flag -n 1

 bowtie genome_prefix -n 1 -c GGAGCGAGCGGAGCGGTACANGG

It should find back your origin sequence even with 1 mismatch (your N in this case). To allow 2 mismatches, use -n 2. Even though up to 3 mismatches are allowed with the -n argument, only 2 mismatches will be tolerated (I wrote an issue in their GitHub repository). The seed length is 28 by default so you don't need to change that as you work with CRISPR-Cas9 target sequences (typically 20 bp). Check more in Bowtie documentation.

Note: I use Bowtie since Bowtie2 allows maximum 1 mismatch, which is a drawback in this case. Note also that while you can search for a sequence containing Ns, Bowtie does not allow alignment to Ns contained in your reference genome (but bowtie2 does). I think it would be nice to have the flexibility of Bowtie regarding the number of mismatches allowed and the ability of Bowtie2 to align to sequences containing Ns. Despite this, Bowtie is used to identify off-targets in the most common webtools for sgRNAs design such as CHOPCHOP or CCTop.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Johan Zicola60
gravatar for Jeremy Leipzig
7.6 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

vmatch is an excellent general aligner

The Vmatch large scale sequence analysis software

ADD COMMENTlink modified 7.6 years ago by Istvan Albert ♦♦ 86k • written 7.6 years ago by Jeremy Leipzig19k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2319 users visited in the last hour