Question: How to annotate repeats?
0
gravatar for Zee_S
12 months ago by
Zee_S20
Zee_S20 wrote:

Hello Biostars community!

I have a database of consensus transposon sequences for a model organism. These sequences come from diverse families and range from very short SINE elements to longer LINEs and LTRs. I want to annotate these repeats on the reference genome and later use those coordinates to correlate repeat density with chIP seq signal.

I would be happy to get your suggestions on tools that I can use to annotate the repeats. and also, what kind of normalization one must do to account for the large sequence length variation between different repeat families. for example, a sine element could give many hits on the genome just because it is relatively short, whereas an LTR may not. so how do you validate your annotations in the end?

Thank you very much for your guidance!

ADD COMMENTlink modified 12 months ago by Beuss100 • written 12 months ago by Zee_S20
2

Hi! Have you tried RepeatMasker? http://www.repeatmasker.org/

ADD REPLYlink written 12 months ago by alessandrotestori7330
4
gravatar for Beuss
12 months ago by
Beuss100
France
Beuss100 wrote:

Hi,

As usual, all depend of which question you to answer. I you want an idea of the global quantity of repeat for your genome, a quick annotation with RepeatMasker/Repbase can be enough. Or there a public annotation layer already available ?

But in your case, because you are searching for a link between binding sites and Transposable Element (TE) presence/absence, you should go for a deeper annotation of repeats. Maybe a TE database dedicated to your specie exists and could be used with TEannot (REPET pipeline) or RepeatMasker to obtain a better annotation. If the available databases are too far from the specie you are analysing or if no data are available you should go for a de novo detection/annotation of repeats. This is a big task, but if you want to be exhaustive, you have no choice.


what kind of normalization one must do to account for the large sequence length variation between different repeat families. for example, a sine element could give many hits on the genome just because it is relatively short, whereas an LTR may not

Your are wrong on this on point. Unless your are SINE copies are less than log4(N) + 1 base pairs (where N is your genome size in base pairs), these copies are real and not issues from random. So you should not under estimate their importance in your analysis. Moreover, if it's the case that would mean the annotation had very bad quality.


so how do you validate your annotations in the end?

You could validate your annotation through the validation of consensus by checking if each consensus you used have at least 3 complete copies in the genome. But you also have to be aware that TE could derives very fast and so a lot of degraded copies of the original TE are also present in the genome. That I why prefer use several consensus (1 for each main degraded copies), TE models, for describing and annotate the whole diversity of TEs.

Anyway this is a very large subject with a lot of debate.

Here a sample of publications for discovering the beautiful world of TEs and their annotation :

ADD COMMENTlink written 12 months ago by Beuss100

Thank you so much for your input! This is very helpful!

ADD REPLYlink written 12 months ago by Zee_S20
1

You're welcome. Do not hesitate if you have any other questions.

ADD REPLYlink written 12 months ago by Beuss100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1464 users visited in the last hour