Question: How To Represent Restriction Sites In Gff3
gravatar for Daniel Standage
9.8 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

I've been looking for examples of how to store restriction sites in GFF3 format, but I have been unable to find any. Assuming I use the correct SO term in the third ('type') column, it shouldn't be too hard. My questions are mostly details.

  • Should the start and stop positions correspond to the recognized palindromic sequence, or should they correspond to the exact cleavage site?
  • When a sequence is digested by a restriction enzyme, do the lengths of the resulting fragments include the sticky, overhanging, single-stranded DNA, or do the lengths only extend as far as the DNA is double-stranded?


gff • 2.5k views
ADD COMMENTlink modified 9.4 years ago by Casbon3.2k • written 9.8 years ago by Daniel Standage3.9k

Just to be clear, when you say "store restriction sites" - you mean that you want to store a feature corresponding to the recognition sequence? So for example, you want to know if EcoRI (GAATTC) should have start = 1 and end = 6 ?

ADD REPLYlink written 9.8 years ago by Neilfws48k

Yeah, I guess that's not very clear from my question. I guess my question deals both with the recognition site and the resulting restriction fragments (which may or may not be stored as features in the same file). Basically, how do I store recognition sites and how is that interpreted in relation to the restriction fragments? If I ignore the sticky ends of the fragments, the combined lengths of the fragments will be less than the original sequence length, but if I include them then the combined length will be greater. Does this make sense?

ADD REPLYlink written 9.8 years ago by Daniel Standage3.9k

Yes, that helps. This is one of those questions that seems simple, until you think about it :-)

ADD REPLYlink written 9.8 years ago by Neilfws48k
gravatar for Neilfws
9.8 years ago by
Sydney, Australia
Neilfws48k wrote:

I have not seen examples of restriction recognition sequences stored as GFF3. However, I'll take an educated guess!

First, I think that it makes sense for start/end to refer to the palindromic sequence. In other words as we discussed above for EcoRI, start = 1, end = 6. The cleavage site could be stored in the last column (as a free text attribute or note).

Second, in terms of storing the fragments that result from digestion - I would not. But if I did, I would think of them in terms of only one strand. So for example, again using EcoRI:


The fragments would be represented by 2 features: one of length 1 (G), the other of length 5 (AATTC). I guess they would best be described as children of the parent feature (the uncleaved site) ? Would be interested in other opinions on that.

It's probably worth searching the SO using the term "restriction" to get a good sense of how sites, fragments etc. are defined.

ADD COMMENTlink modified 10 months ago by RamRS27k • written 9.8 years ago by Neilfws48k

So let's say for the bogus sequence ACGTGAATTCACGT, EcoRI would give us ACGTG and AATTCACGT, and our GFF feature would have start=5 and end=10?

ADD REPLYlink modified 10 months ago by RamRS27k • written 9.8 years ago by Daniel Standage3.9k

That would be correct. OK, so you want to annotate sites within larger sequences, rather than storing just sites.

ADD REPLYlink written 9.8 years ago by Neilfws48k
gravatar for iw9oel_ad
9.8 years ago by
iw9oel_ad6.1k wrote:

I'm not aware of a standard for this. It's a bit complex given that restriction enzymes can have multiple cleavage sites, some of which may be outside of the recognition site. e.g.

HaeIV       (7/13)GAYNNNNNRTC(14/9)

in REBASE notation.

I would use multiple, linked GFF features; one for the recognition site, others for the cleavage site(s). Putting the REBASE notation into the feature tags wouldn't hurt.

Most annotation systems, including GFF3, generally ignore the issue of DNA strandedness for the main sequence. That said, there is the SO:0000984 (single strand) sequence attribute that can be applied to located sequence features. There isn't a specific SO term for sticky ends. For your digested fragments, I would use such features to indicate the sticky ends.

If you have a lot of annotation that will become public, I would contact the SO curators to discuss some specific new terms because there don't seem to be any in SO at the moment e.g. RE recogition site, sticky end etc.

ADD COMMENTlink modified 10 months ago by RamRS27k • written 9.8 years ago by iw9oel_ad6.1k

I forgot that some enzymes cut outside the recognition site. Based on neilfws' suggestion, I took a closer look at the SO terms, and I found 'restriction_enzyme_binding_site' and 'restriction_enzyme_cut_site'. An enzyme's binding site and recognition site are not necessarily synonymous, are they?

ADD REPLYlink written 9.8 years ago by Daniel Standage3.9k

Sometimes the terms are used synonymously e.g in my copy of Genes (Lewin), but the precise definitions are not given (this is natural language). I'd expect an ontology to be more precise. Ah, I didn't spot restriction_enzyme_binding_site. According to the MISO browser, restriction_enzyme_cut_site is obsolete, by the way.

ADD REPLYlink written 9.8 years ago by iw9oel_ad6.1k
gravatar for Pierre Lindenbaum
9.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

If you put the nomenclature of Rebase in your file to define the restriction site (for example EcoR1 is defined as:


) you should be able to get the position of the sticky ends one your sequence on both strands.

For example:

cut=    {
    site: "G^AATTC", ## cut position are 1 and length()-1=5
    position: 1

then on the left side of the cutting site, the sequence ends at 1+1 on the "+" strand and ends at 1+5 on the "-" strand.

for the second part of your question, you should only consider the "+" strand . The sticky site won't be extended unless you use a DNA polymerase. ( e.g. T7 DNA polymerase has a 3' -> 5' exonuclease activity. )

ADD COMMENTlink modified 10 months ago by RamRS27k • written 9.8 years ago by Pierre Lindenbaum129k

BioPerl uses the G^AATTC notation as well.

ADD REPLYlink written 9.8 years ago by Lee Katz3.0k
gravatar for Casbon
9.8 years ago by
Casbon3.2k wrote:

Here's what I do:

16      xxx     restriction_enzyme_binding_site 69742532        69742536        .       .       .       enzyme=FatI;ID=4ca20b5cf9d96d3ddb000000;site=CATG
16      xxx     restriction_enzyme_binding_site 69743737        69743741        .       .       .       enzyme=FatI;ID=4ca20b5cf9d96d3ddb000001;site=CATG
16      xxx     restriction_fragment    69742532        69743741        .       .       .       enzymes=FatI;left_site=4ca20b5cf9d96d3ddb000000;ID=4ca20b5cf9d96d3ddb000059;right_site=4ca20b5cf9d96d3ddb000001

Site positions match the entire palindromic sequence, fragments extend over the single stranded DNA.

But then FatI is not that complicated.

ADD COMMENTlink modified 10 months ago by RamRS27k • written 9.8 years ago by Casbon3.2k
gravatar for Lee Katz
9.8 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

I believe in GFF3 you can use a hierarchical approach. Therefore, list the target site in one line.

Here's my "pseudogff" which lists the restriction site. Then, the start/stop as position 2 for the cut site.

genomeId source ..... name=G^AATTC;seqid=cutSite1
cutSite1 source .. 2  2 ......
ADD COMMENTlink modified 10 months ago by RamRS27k • written 9.8 years ago by Lee Katz3.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1539 users visited in the last hour