I am working with a set of DNA motifs that are predicted as potential regulatory motifs (e.g. transcription factor binding sites). The motifs belong to several species, and I wanted to cluster these motifs via their Position Weight Matrices (PWMs) (also known as PSSMs) to collapse similar motifs together into groups.
A tool called MATLIGN (website here) does what I need, but their required format for the PWMs are different to what I have, they claim:
"Matrices must be in the frequency matrix format (only integer numbers are acceptable)"
The problem is that my PWM matrices do not have integer numbers but decimals instead. e.g.:
A C G T
1 0.000000 1.000000 0.000000 0.000000
2 1.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 1.000000 0.000000
4 0.000000 0.421755 0.000000 0.578245
5 0.289407 0.000000 0.282556 0.428038
In other words, instead of the decimal values I have in my matrix I need to have integer counts. Could anybody suggest what I can do? Would I need to create artificial counts?
That looks a lot like a position frequency matrix (PFM) where the counts were divided by the row total. Unless you know that this had a background nucleotide frequency taken into account you can probably just multiply everything by a constant and round to make it into 'counts'. You can also use a tool like TOMTOM to do this where it doesn't require integers.
@UnivStudent: I have actually used TOMTOM before, unfortunately they only do pairwise motif comparisons. I was hoping to use a more advanced method that carries out clustering as well.
PWM contain less information than the actual counts. Where did you obtain the PWM from? Try to find the counts or actual sequences as well.