Could you tell me, how can I test, whether distribution of SNPs across each chromosome is uniform? Any soft, or idea how can I test it?
Do you mean the number of SNPs on each chromosome, or that they are uniformallly distributed within a chormosome?
I mean uniform distributed within a chromosome. I have files with distance between SNPs to the closest, but I have no idea how to check uniform distribution based on this.
I guess Poisson distribution makes more sense here and in that case distance between 2 adjacent SNPs should be gamma distributed.
Specially, you'll need to do a goodness of fit test to a gamma distribution on the inter-SNP distances.
Ok, thank you guys. So should I test whether inter-SNP distances (between two closest SNPs) have Poisson or Gamma distribution? Which distribution should I use? Poisson is discrete distribution, and Gamma is continuous.
and what about other information that can I used: e.g. I can divide chromosome into 100kb non-overlapping window and count number of SNPs in each window, and then test this number for randomness (Poisson? Gamma?)
What difference information can I get about a distribution of SNPs using inter-snp distance and number of snp in 100kb window?
If you want to test the number of SNPs in a window is the same across a chromosome, use a Poisson.
If you want to test the inter-SNP distance then use a Gamma.
Both approaches are valid, but the Poisson approach won't tell you whether reads are evenly distributed within a bin, only if they are distributed evenly between bins. On the flip side a Gamma will tell you if the reads are evenly distributed, but not where the divergence from random is if a divergence is detected.
Thank You! I have another question. I have also information about distance of the snp to the cloest indel. Should I also use a gamma distribution to check whether snps are close to indels or if it is a randomness?
And the same question about length of indels. To check whether long of indels (Bp) is random.
I think you should be able to model indel-SNP distance as a gamma, but I'm not 100% sure. Same applies to indel length. There is a case for it to be gamma, but I'm pretty sure thats not going to be the case, as longer indels are definitely more likely to be deleterious than short ones.
This paper talks about models for random distribution of indels. It might be helpful.
Rands et al "8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage"
See the section of the methods on the Neutral Indel Model.
Thank you. Could you tell me where can I find information that to test inter-SNP distance I should use gamma distribution and Poisson distribution for number of snps in windows? I need a reference why I am using this distribution instead of uniform distribution.
The position of any one SNP within a window is uniformly distributed. The count of SNPs poisson and the inter-SNP distance Gamma.
To be honest the best reference for any of this would be any standard stats textbook with a section on modelling discrete events in time and space. But the paper I quote above, and references with in it (paticulalry Lunter et al), might be possibilities.
Ok. Thank you.
I still do not understand, why I can't use a uniform distribution to check whether the number of variants in windows (non-overlapping) is randomly distributed or is constant
Under a uniform distribution all outcomes are equally likely, so if we said, for example that the number of SNPs in a 1kb window was uniformally distributed with distribution U(0,1000), we'd be saying that it was equally likely that the window would have 1 SNP as it was for it to have 100 or 1000. P(1)=P(100)=P(1000)=0.001. This is the definition of a Uniform distribution.
If we want to say that SNPs are randomly distributed, then what we are really saying is that any given base is equally likely to have a SNP as any other. At each point we ask whether or not a SNP is present with a given probability (a bit like tossing a coin). This is called a Bernoulli trial. If we walk across 1000 bases this way, tossing a coin at each base, we can work out what the chance of getting each different number of SNPs is. The number of successes out a total number of Bernoulli trials is technically a binomial distribution, but Poisson is a good approximation to binomial once you are talking about enough bases (its the continuous time/space equivalent) and is way easier to work with. What we would find, if we had a 1kb window and a 1/100 SNP rate is that most windows would contain around 10 SNPs - fewer would contain either 9 or 11.
Ok, thank you! Now I understand, but which value should I use for lambda in Poisson distribution.
I would like to test this using Kolmogorov-Smirnov. I have vector with number of SNPs in each 100kb non-overlapping windows, but for testing whether the numer of SNPs is random (random number of SNPs = Poisson distribution, right?) I have to compare with Poisson distribution with lambda distribution. Lambda is unknown. Should I estimate lambda based on my number of SNPs, assuming that lambda=mean?
At this stage I would suggest to upvote/accept_as_answer the detailed comments by @i.sudbery which nicely explain the concepts behind.
Yes, divide the total number of SNPs on the Chr by the length of the Chr and multiple by the length of your window.