Question: How To Use Bwa To Find Restriction Enzyme Coordinates In A Genome
2
gravatar for Fidel
8.4 years ago by
Fidel1.9k
Germany
Fidel1.9k wrote:

I want to find the coordinates of all occurrences of the sequence recognized by a restriction enzyme. I know that using EMBOSS I may do this, but this task seems perfectly fitted for short-read sequence alignment software. However, I didn't find any reference for this.

I used bwa for the task and quickly obtained some results. However, to be on the safe side I will like to ask is someone has done something similar or has some advice, perhaps I am stretching the use of bwa.

I tried the following:

echo -e ">DpnII\bGATC" > DnpII.fa
bwa aln -N -n 0 -o 0 -e 0 -l 4 -k 0 dm3 DpnII.fa > DpnII.sai
bwa samse -n 100000000 dm3 DpnII.sai DpnII.fa > DpnII.sam

The results seem to match the right sites, also the number of sites (489570 for Drosophila) that I obtain are close to what I expect.

Thanks.

enzyme bwa • 3.3k views
ADD COMMENTlink modified 3.5 years ago • written 8.4 years ago by Fidel1.9k
2
gravatar for Fidel
3.5 years ago by
Fidel1.9k
Germany
Fidel1.9k wrote:

Here is an update for people landing on this question: The bwa method that I proposed does not work properly. I don't remember exactly why it was wrong but I stop using it. I think it was returning regions with NNNs.

My solution was to use Biopython to search for a pattern. This is quite fast and accurate. Here is the code that finds and sorts the results:

https://github.com/maxplanck-ie/HiCExplorer/blob/master/hicexplorer/findRestSite.py

[update] Based on the comments by Michael Dondrup I fixed an error in the code.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Fidel1.9k

Why don't you simply use an existing solution like remap? The prime advantage being: remap knows all the common restriction enzyme sequences already. And it is well tested. The Gbrowse plugin on the other hand is very comfortable to use when looking at a certain region of the genome.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Michael Dondrup47k

At the time I needed something that I could add to our set of tools to process and analyze Hi-C data. Having Emboss or GBrowse2 as dependencies just to find restriction sites was not worth (BWA was already a dependency). The solution using bio-python is so simple that I didn't mind re-implementing it.

ADD REPLYlink written 3.5 years ago by Fidel1.9k

Did you check the result then, because I think there is an error in the code with the reverse complement?

ADD REPLYlink written 3.5 years ago by Michael Dondrup47k

I hope there is no bug. I tested the results and so far the only problem I found was with upper/lower case DNA strings that I initially overlooked. What would be your concern with the reverse complement?

ADD REPLYlink written 3.5 years ago by Fidel1.9k
rev_compl = str(Seq(pattern, generic_dna).complement())

don't you need to reverse the complement also? I think you should use the reverse_complement function over complement.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Michael Dondrup47k

That line returns the reverse complement of the motif (referred as pattern)

pattern = "ACGT" 
str(Seq(pattern, generic_dna).complement())
'TGCA'

Then, the motif and reverse complement of the motif are searched on the DNA sequence on the fasta file. Finally, the resulting bed file is sorted.

The motif can be a regexp (e.g. AC.GT) and the reverse complement function of the motif only changes the instances of the letters.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Fidel1.9k

I don't think so, why else is there a function reverse_complement in BioSeq. Your test pattern is a bit misleading, because the reverse complement of ACGT is ACGT (not TGCA) !

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Michael Dondrup47k
1

You are right! Should be the reverse_complement() function. Because all restriction enzymes that I know are palindromic I always get the correct answer in my tests. This is so embarrassing :(

ADD REPLYlink written 3.5 years ago by Fidel1.9k
1

Well, I checked on the palindromic feature, I found it mentioned that most but not all restriction enzymes cut at palindromic sites. If your sites were palindromes, then you are probably safe with the data generated. I found this list of non-palindromic enzymes: https://www.neb.com/tools-and-resources/selection-charts/enzymes-with-nonpalindromic-sequences

I think that a lot of such 'semi-errors' harbor in random code written from scratch by anyone. This is my argument against writing any new code for these standard problems if avoidable.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Michael Dondrup47k
1

I just updated the github page to use the reverse_complement. I tested the new code with one instance of a non-palindromic enzyme and compared the results with Emboss remap. Thanks for pointing out the error.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Fidel1.9k
1
gravatar for Michael Dondrup
8.4 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Finding these sites is an exact matching problem, so anything that can do exact matching should work. However there are some established tools, that you could use as well. I would at least compare the results to what remap delivers.

GBrowse2 contains a plug-in for computing and visualizing restriction enzyme sites. It is activated in the standard install but unfortunately I didn't see it in flybase. You would therefore have to download the genome sequence and put it in a local instance.

EMBOSS has some tools too like remap. There is also a web interface to remap here.

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Michael Dondrup47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1183 users visited in the last hour