Dear Biostar community, I have a statistical bioinformatics question.
I need to compare enrichment for tandem repeats between two sets of sequences.
Example data looks like this:
Tandem Repeat Set 1 % Set 2 %
AATGACAT 0.3 0.1
ATATGC 6 0
...
I want to find out if Set 1
is enriched/depleted for a given tandem repeat compared to a Set 2
. My idea was to compare log 2 ratio between these sets. This is my awk solution:
awk '{print log($2/$3)/log(2)}' file
1.58496
inf
Set 1
is enriched for both tandem repeats - however I don't know if this is a right way of solving this problem.
My questions is:
Given two sets of sequences that differ in length - what is the best way to calculate for a tandem repeat enrichment (aka coverage by specific sequence), is it log 2 ratio?