Question: Statistics: Tandem Repeat Enrichment Between Two Sets Of Sequences
2
6.5 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

I have two sets of sequences ( >1000 sequences in every set; sequence length varies from 1000bp to 100000bp) and tandem repeat hits in every sequence. I would like to test hypothesis that the first set is enriched in tandem repeat.
Example:

`````` Set_1_Seq_1   NNNACGTACGTNACGTNNN...
Set_1_Seq_2   ACGTACGTNACGTNNNN...
Set_1_Seq_3   NACGTACGTACGTNNN...
...
Set_2_Seq_1   NNNNACGTACGTNNN...
Set_2_Seq_2   NNNNNNNNNNNNNNN...
Set_2_Seq_3   NNNACGTACGTNNNN...

Tandem repeat unit: ACGT
``````

How can I test if Set_1 is enriched in tandem repeat compared to Set_2?

My ways of doing this:

1. Count how many Set_1 sequences have/don't have tandem repeat; Count how many Set_2 sequences have/don't have tandem repeat.
Use Fisher test.
2. Count how many times repeat appears per sequence in every set; Compare such hits per sequences between sets.
(For a given example above that would be: Set_1:3,3,3; Set_2:2,0,2).
What test I could use for such comparison?
3. Calculate percentage of every sequence covered with tandem repeat; Compare percentage of coverage.
What test I could use for such comparison?

Example of data table:

``````Seq_name       Length       Contains repeat (0/1)       Times of repeat       Coverage with repeat (%)
Set_1_Seq1      1000                 1                       20                         8
Set_1_Seq2      2000                 1                       50                         10
Set_1_Seq3      18000                1                       1000                       22
...
Set_2_Seq1      100000               1                       20                         0.4
Set_2_Seq2      5000                 0                       0                          0
Set_2_Seq3      10000                0                       0                          0
...
``````

My question is - How can I test enrichment for a given tandem repeat between to sets of sequences?
- Is it ok to use Fisher test for solution 1?
- What test I could use for solution 2/3?

I really hope someone will help me with this.

PS.:
Similar question was asked how to find the enriched repeat elements between two sequences , but Fisher test don't take number of repeats into account.

Edit.
Nice example of repeat enrichment per set of sequences (Relationship of repetitive elements to EZH2 sites from 22948768). In Figure A they calculated odds for different sites, but from the article or supplements I can't understand how they did this.

enrichment statistics • 2.3k views
modified 6.5 years ago by matted7.2k • written 6.5 years ago by PoGibas4.8k

What paper is that figure from?

Spreading of X chromosome inactivation via a hierarchy of defined Polycomb stations (Pinter et al, Genome Res. 2012)

Did you manage to find out more on this type of analysis? If yes, can you please share it?

What do you want to know?

1
6.5 years ago by
matted7.2k
Boston, United States
matted7.2k wrote:

There are a lot of reasonable ways to attack these problems, but my personal bias would be to assess the significance of all results with permutation tests. The basic idea is you pick any test statistic you like (you gave solid choices as your #1, #2, and #3), then measure it on many permuted versions of the dataset along with the original dataset. In your case, permuted version means shuffling the labels of "Set 1" and "Set 2". You pick a significance threshold from the empirical distribution you get from analyzing the permuted datasets. With this, you only have to worry about the samples being exchangeable under the null hypothesis, as opposed to stronger assumptions I'd think you'd have to make to apply specific parametric tests.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.