Question

How To Find The Enriched Repeat Elements Between Two Sequences

0

Entering edit mode

12.1 years ago

Free Man ▴ 180

Hi, I want to know which repeat element is statistically enriched in one sequence compared to the background sequence, how should I perform such a statistic calculate?
For repeat data, I have got bed format repeatmasker from UCSC.
For example, waht should I do if I want to know the enrichment of tandem repeat “(CAG)n” ?
Thanks.

repeats sequence enrichment • 4.7k views

ADD COMMENT • link updated 6 months ago by guliar • 0 • written 12.1 years ago by Free Man ▴ 180

0

Entering edit mode

For which repeat elements are you looking? Microsatellites or transposable elements?

ADD REPLY • link 12.1 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

I just want to learn the statistic method for sequence enrichenment analysis, so to make it simple, waht if I want to know the tandem repeat “(CAG)n” for example?

ADD REPLY • link 12.1 years ago by Free Man ▴ 180

score 7 · Answer 1 · 2012-03-02

Assuming that your UCSC repeatmasker BED file looks like this:

#genoName    genoStart    genoEnd    strand    repName    repClass    repFamily
chr1    16777160    16777470    +    AluSp    SINE    Alu
chr1    25165800    25166089    -    AluY    SINE    Alu
chr1    33553606    33554646    +    L2b    LINE    L2
chr1    50330063    50332153    +    L1PA10    LINE    L1
chr1    58720067    58720973    -    L1PA2    LINE    L1
chr1    75496180    75498100    +    L1MB7    LINE    L1

and you are interested in the repeat elements by family (such as Alu, L1, L2), you can view the problem as sampling repeat elements (with your sequence) from all elements in the genome. The following steps should give you a measure of enrichment along with a p-value.

First use BEDTools to retrieve all rep elements in your sequence from the UCSC BED file.

Then, for each rep element family you found in your seq, count

how often it appears in your seq = s
how often it appears in the genome = g

Then count

how many rep elements are in your seq in total = S
how many rep elements are in the genome in total = G

Then,

f = s/S in the fraction of the element in your seq
F = g/G is the fraction of the element in the genome, and
f/F is the enrichment.

To get a p-value for the enrichment, do a Fisher's exact test with s, g, S, and G.

score 0 · Answer 2 · 2023-10-07

Slc1a2 Plpp3 Sfxn5 Pitpnc1 Cst3 Itih3 Phactr1 Tra2a Phkg1 Zfp949 Adrbk2 Polr2a Guf1 A930015D03Rik Slc4a4 Slc25a21 Slc6a11 Fgf14 Abca1 Chuk Zfp36l1 Slc7a11 Gabbr1 Msmo1 Cspg5 Camk2g Sgcd Cdh19 Igf2bp3 Galnt16 Clybl Tprkb Plp1 1700112E06Rik Gm4876 Meis1 Mtss1l 9330159F19Rik Vegfa L3mbtl3 Mgat5 Kcnj10 Arpp21 Dlg2 Robo2 Arhgef10l Nrg1 Ptn Hes5 Pcyt2 Ednrb Adra1a Gabra2 Clu Phyhipl Cables1 Emx2os Caskin1 Ptch1 Nav3 Nnat Lrig1