I have a library of Transposable elements identified de novo using REPET. I would like to find out what percent of the genome each of these repeats masks (individually) using RepeatMasker.
I'm worried that running RepeatMasker with a single TE consensus sequence in a library will mask instances in the genome which are similar, but in fact belong to a different family (they have less than 80% similarity to the consensus). So, I masked the genome using the entire library of repeats.
The individual breakdown of hits/hsps (I'm using rmblast, rather than cross_match) is in the *.out file. I have tallied all hits for a given repeat in this file, and consider that to be the estimation of that repeat's distribution in the genome. When I sum all of these percents, however, I am given a number quite a bit larger than the % of the genome masked in the .tbl file (64% in the tbl file, 79% by summing the .out lines)
Where does this discrepancy arise? Is there a way to correct for it or get around it? Am I going about this all wrong?
Thanks in advance!
When using RepeatMasker, you can set allowed divergence from consensus sequence, e.g. -div 6.