We have built a strongly supported map from RAD markers in a fish species and wanted to assess whether our map showed strong synteny with the published Danio rerio genome. Up to now, we used a somewhat crude method, but it gives very strong support for the synteny.
I blasted all the markers (RAD tag sequences of just below 100bp) on the Danio transcriptome (blastx). I then counted how many times each of our linkage groups blasted to each Danio chromosome. With this approach, we get a very non-random pattern where each of our linkage group hits strongly on one to three Danio chromosomes, suggesting very good synteny. I then represent the result as a bubble plot where the size of the bubbles represents the number of blasts that link each linkage group (y axis) to each chromosome (x axis), for visualization purposes.
What I would like to know about now is: is there an accepted method for assessing the strength of the synteny, maybe returning a p-value?
I can have finer grained data from the blasts, but for now I concentrate only on counting the number of times that each of our linkage groups hit each Danio chromosome, eg:
Count Us Danio 3 1 1 5 1 2 1 1 3 ... 8 28 24 0 28 25
One quick and dirty method I was thinking about is to use a bootstrap process and see how far our distribution lies in the mass or randomly generated distribution. I am not sure what estimator of the distribution I would use for that, but it may not be to hard to think of a good one.
I would love to have your thoughts on existing methods.
EDIT: A friend suggested doing a linear regression on the (x, y) coordinates of Genome1 vs. Genome2 position. It seems a logical choice to see if there is synteny, but I wonder against what to test it to prove that it is really significant. I find the p-value of the regression itself is a bit arbitrary to conclude that there is or not such synteny.
As always, your thoughts are appreciated!