Question: bedtools shuffle vs. random
0
gravatar for blur
2.4 years ago by
blur110
European Union
blur110 wrote:

Hi! I want to create a random list of bed locations to use and see if the intersection between my dataset and a dataset from a paper I read is significant. I am not sure which tool is more logical for what I want to do: shuffle or random? I want to use the end of genes, is there a big difference if I create a specific file to use as genome in random vs. if I add an incl file to shuffle?

Thank you for your help!

bedtools • 1.7k views
ADD COMMENTlink modified 2.4 years ago by bernatgel2.0k • written 2.4 years ago by blur110
2
gravatar for A. Domingues
2.4 years ago by
A. Domingues2.1k
Dresden, Germany
A. Domingues2.1k wrote:

Well random will create random locations of a particular length and shuffle locations that will be length matched an input bed file. One downside of random is that strand locations will also be random, which means that if there is some strand bias on your experimentally derived data, you might get significant differences that are not there.

I want to use the end of genes

Will these be defined as regions X bp from transcription termination site, that is all have the same size? If not, and in view of the strand issues, my advise would be to use shuffle. It will alleviate strand bias issues, and allow more control over what your control regions look like - that is they will be more matched to your locations.

Also, generate the control set multiple time (say 1000) to perform the comparison multiple times - you can then calculate the average and standard deviation of those 1000 permutations. This will ensure that the effect that you see (or not) is stable.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by A. Domingues2.1k

yes, there is a bias in the experiment - thank you so much for your help!

ADD REPLYlink written 2.4 years ago by blur110
0
gravatar for bernatgel
2.4 years ago by
bernatgel2.0k
Barcelona, Spain
bernatgel2.0k wrote:

If you can use R, you could user regioneR to test if there is a significant overlap between your dataset and the one from the publication. It will perform the whole process explained by @fridaymeetssunday: randomization a number of times (1000), computing the overlaps, the mean and standard deviation and finally answer with a p-value, a z-value (and a plot if you need it).

The package has different options and parameters and you should select the randomization strategy according to your needs (if working with genes, probably resampling instead of randomizing completely, or restricting the randomization space with a stringent mask). In the package vignette you can find more information and examples.

NOTE: right now regioneR's randomization is not strand specific, so you should take this into account if you need strand specific random regions.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by bernatgel2.0k

I was not aware if this package. From a (very) brief read looks very useful.Thanks.

Edit: the creation of random regions appears to be strand-agnostic. Is this correct?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by A. Domingues2.1k

Yes, I forgot to add this to my answer. Strand specific randomization is in the pipeline but not ready yet.

It is possible to do it in a strand specific way right now by defining a custom randomization function that internally randomizes separately according to strand. If you think you strand specific randomization would be an important feature for you, please contact me and we'll try to speed it up.

ADD REPLYlink written 2.4 years ago by bernatgel2.0k

Sorry for the delay in answering. I am not interested in randomization according to strand for any specific purpose at the moment, I just thought this would be an important feature missing the package. For instance the OP's data is strand biased, and this is not, in my experience uncommon.

ADD REPLYlink written 2.4 years ago by A. Domingues2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 690 users visited in the last hour