How To Find The Enriched Repeat Elements Between Two Sequences
Hi, I want to know which repeat element is statistically enriched in one sequence compared to the background sequence, how should I perform such a statistic calculate?
For repeat data, I have got bed format repeatmasker from UCSC.
For example, waht should I do if I want to know the enrichment of tandem repeat “(CAG)n” ?
Thanks.

repeats sequence enrichment • 3.8k views
For which repeat elements are you looking? Microsatellites or transposable elements?

I just want to learn the statistic method for sequence enrichenment analysis, so to make it simple, waht if I want to know the tandem repeat “(CAG)n” for example?

#genoName    genoStart    genoEnd    strand    repName    repClass    repFamily
chr1    16777160    16777470    +    AluSp    SINE    Alu
chr1    25165800    25166089    -    AluY    SINE    Alu
chr1    33553606    33554646    +    L2b    LINE    L2
chr1    50330063    50332153    +    L1PA10    LINE    L1
chr1    58720067    58720973    -    L1PA2    LINE    L1
chr1    75496180    75498100    +    L1MB7    LINE    L1


and you are interested in the repeat elements by family (such as Alu, L1, L2), you can view the problem as sampling repeat elements (with your sequence) from all elements in the genome. The following steps should give you a measure of enrichment along with a p-value.

First use BEDTools to retrieve all rep elements in your sequence from the UCSC BED file.

Then, for each rep element family you found in your seq, count

• how often it appears in your seq = s

• how often it appears in the genome = g

Then count

• how many rep elements are in your seq in total = S

• how many rep elements are in the genome in total = G

Then,

• f = s/S in the fraction of the element in your seq

• F = g/G is the fraction of the element in the genome, and

• f/F is the enrichment.

To get a p-value for the enrichment, do a Fisher's exact test with s, g, S, and G.

Thanks a lot, is this a generally accepted method of calculating?
I think like this, I agree with your s and g, but I think the S and G should be like this (theoretical frequencies rather than just counting all repeat elements): assuming the lenght of repeat is x, and the lengths of my sequence and genome are m and n respectively. S=m/x and G=n/x.
what do you think?

Yes, you could as well use sequence lengths instead of simple counts. Not sure if there is a generally accepted method.