How To Represent Restriction Sites In Gff3
5
6
Entering edit mode
12.5 years ago

I've been looking for examples of how to store restriction sites in GFF3 format, but I have been unable to find any. Assuming I use the correct SO term in the third ('type') column, it shouldn't be too hard. My questions are mostly details.

• Should the start and stop positions correspond to the recognized palindromic sequence, or should they correspond to the exact cleavage site?
• When a sequence is digested by a restriction enzyme, do the lengths of the resulting fragments include the sticky, overhanging, single-stranded DNA, or do the lengths only extend as far as the DNA is double-stranded?

Thanks!

gff • 3.2k views
0
Entering edit mode

Just to be clear, when you say "store restriction sites" - you mean that you want to store a feature corresponding to the recognition sequence? So for example, you want to know if EcoRI (GAATTC) should have start = 1 and end = 6 ?

0
Entering edit mode

Yeah, I guess that's not very clear from my question. I guess my question deals both with the recognition site and the resulting restriction fragments (which may or may not be stored as features in the same file). Basically, how do I store recognition sites and how is that interpreted in relation to the restriction fragments? If I ignore the sticky ends of the fragments, the combined lengths of the fragments will be less than the original sequence length, but if I include them then the combined length will be greater. Does this make sense?

0
Entering edit mode

Yes, that helps. This is one of those questions that seems simple, until you think about it :-)

4
Entering edit mode
12.5 years ago
Neilfws 49k

I have not seen examples of restriction recognition sequences stored as GFF3. However, I'll take an educated guess!

First, I think that it makes sense for start/end to refer to the palindromic sequence. In other words as we discussed above for EcoRI, start = 1, end = 6. The cleavage site could be stored in the last column (as a free text attribute or note).

Second, in terms of storing the fragments that result from digestion - I would not. But if I did, I would think of them in terms of only one strand. So for example, again using EcoRI:

G^AATTC


The fragments would be represented by 2 features: one of length 1 (G), the other of length 5 (AATTC). I guess they would best be described as children of the parent feature (the uncleaved site) ? Would be interested in other opinions on that.

It's probably worth searching the SO using the term "restriction" to get a good sense of how sites, fragments etc. are defined.

0
Entering edit mode

So let's say for the bogus sequence ACGTGAATTCACGT, EcoRI would give us ACGTG and AATTCACGT, and our GFF feature would have start=5 and end=10?

0
Entering edit mode

That would be correct. OK, so you want to annotate sites within larger sequences, rather than storing just sites.

3
Entering edit mode
12.5 years ago

I'm not aware of a standard for this. It's a bit complex given that restriction enzymes can have multiple cleavage sites, some of which may be outside of the recognition site. e.g.

HaeIV       (7/13)GAYNNNNNRTC(14/9)


in REBASE notation.

I would use multiple, linked GFF features; one for the recognition site, others for the cleavage site(s). Putting the REBASE notation into the feature tags wouldn't hurt.

Most annotation systems, including GFF3, generally ignore the issue of DNA strandedness for the main sequence. That said, there is the SO:0000984 (single strand) sequence attribute that can be applied to located sequence features. There isn't a specific SO term for sticky ends. For your digested fragments, I would use such features to indicate the sticky ends.

If you have a lot of annotation that will become public, I would contact the SO curators to discuss some specific new terms because there don't seem to be any in SO at the moment e.g. RE recogition site, sticky end etc.

0
Entering edit mode

I forgot that some enzymes cut outside the recognition site. Based on neilfws' suggestion, I took a closer look at the SO terms, and I found 'restriction_enzyme_binding_site' and 'restriction_enzyme_cut_site'. An enzyme's binding site and recognition site are not necessarily synonymous, are they?

0
Entering edit mode

Sometimes the terms are used synonymously e.g in my copy of Genes (Lewin), but the precise definitions are not given (this is natural language). I'd expect an ontology to be more precise. Ah, I didn't spot restriction_enzyme_binding_site. According to the MISO browser, restriction_enzyme_cut_site is obsolete, by the way.

2
Entering edit mode
12.5 years ago

If you put the nomenclature of Rebase in your file to define the restriction site (for example EcoR1 is defined as:

G^AATTC


) you should be able to get the position of the sticky ends one your sequence on both strands.

For example:

cut=    {
enzyme:"EcorRI",
site: "G^AATTC", ## cut position are 1 and length()-1=5
position: 1
}


then on the left side of the cutting site, the sequence ends at 1+1 on the "+" strand and ends at 1+5 on the "-" strand.

for the second part of your question, you should only consider the "+" strand . The sticky site won't be extended unless you use a DNA polymerase. ( e.g. T7 DNA polymerase has a 3' -> 5' exonuclease activity. )

0
Entering edit mode

BioPerl uses the G^AATTC notation as well.

2
Entering edit mode
12.5 years ago
Casbon ★ 3.2k

Here's what I do:

16      xxx     restriction_enzyme_binding_site 69742532        69742536        .       .       .       enzyme=FatI;ID=4ca20b5cf9d96d3ddb000000;site=CATG
16      xxx     restriction_enzyme_binding_site 69743737        69743741        .       .       .       enzyme=FatI;ID=4ca20b5cf9d96d3ddb000001;site=CATG
16      xxx     restriction_fragment    69742532        69743741        .       .       .       enzymes=FatI;left_site=4ca20b5cf9d96d3ddb000000;ID=4ca20b5cf9d96d3ddb000059;right_site=4ca20b5cf9d96d3ddb000001


Site positions match the entire palindromic sequence, fragments extend over the single stranded DNA.

But then FatI is not that complicated.

1
Entering edit mode
12.5 years ago
Lee Katz ★ 3.1k

I believe in GFF3 you can use a hierarchical approach. Therefore, list the target site in one line.

Here's my "pseudogff" which lists the restriction site. Then, the start/stop as position 2 for the cut site.

genomeId source ..... name=G^AATTC;seqid=cutSite1
cutSite1 source .. 2  2 ......