Question: How to annotate repeats?
0
gravatar for Zee_S
21 months ago by
Zee_S50
Zee_S50 wrote:

Hello Biostars community!

I have a database of consensus transposon sequences for a model organism. These sequences come from diverse families and range from very short SINE elements to longer LINEs and LTRs. I want to annotate these repeats on the reference genome and later use those coordinates to correlate repeat density with chIP seq signal.

I would be happy to get your suggestions on tools that I can use to annotate the repeats. and also, what kind of normalization one must do to account for the large sequence length variation between different repeat families. for example, a sine element could give many hits on the genome just because it is relatively short, whereas an LTR may not. so how do you validate your annotations in the end?

Thank you very much for your guidance!

ADD COMMENTlink modified 21 months ago by Beuss110 • written 21 months ago by Zee_S50
2

Hi! Have you tried RepeatMasker? http://www.repeatmasker.org/

ADD REPLYlink written 21 months ago by alessandrotestori7390
4
gravatar for Beuss
21 months ago by
Beuss110
France
Beuss110 wrote:

Hi,

As usual, all depend of which question you to answer. I you want an idea of the global quantity of repeat for your genome, a quick annotation with RepeatMasker/Repbase can be enough. Or there a public annotation layer already available ?

But in your case, because you are searching for a link between binding sites and Transposable Element (TE) presence/absence, you should go for a deeper annotation of repeats. Maybe a TE database dedicated to your specie exists and could be used with TEannot (REPET pipeline) or RepeatMasker to obtain a better annotation. If the available databases are too far from the specie you are analysing or if no data are available you should go for a de novo detection/annotation of repeats. This is a big task, but if you want to be exhaustive, you have no choice.


what kind of normalization one must do to account for the large sequence length variation between different repeat families. for example, a sine element could give many hits on the genome just because it is relatively short, whereas an LTR may not

Your are wrong on this on point. Unless your are SINE copies are less than log4(N) + 1 base pairs (where N is your genome size in base pairs), these copies are real and not issues from random. So you should not under estimate their importance in your analysis. Moreover, if it's the case that would mean the annotation had very bad quality.


so how do you validate your annotations in the end?

You could validate your annotation through the validation of consensus by checking if each consensus you used have at least 3 complete copies in the genome. But you also have to be aware that TE could derives very fast and so a lot of degraded copies of the original TE are also present in the genome. That I why prefer use several consensus (1 for each main degraded copies), TE models, for describing and annotate the whole diversity of TEs.

Anyway this is a very large subject with a lot of debate.

Here a sample of publications for discovering the beautiful world of TEs and their annotation :

ADD COMMENTlink written 21 months ago by Beuss110

Thank you so much for your input! This is very helpful!

ADD REPLYlink written 21 months ago by Zee_S50
1

You're welcome. Do not hesitate if you have any other questions.

ADD REPLYlink written 21 months ago by Beuss110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2343 users visited in the last hour