Question: Any idea about validating bedtools shuffle did what I desired?
2
gravatar for morovatunc
4.1 years ago by
morovatunc460
Turkey
morovatunc460 wrote:

Hi,

I have obtained binding locations of a specific protein on DNA by CHIP-seq. My hypothesis is that within those regions in specific cancer type, I expect to see higher mutation rate than random regions. I think, bootstrapping could solve the solution by comparing random vs my bed file.

For this reason, I searched through biostars and I found that bedtools shuffle is a viable option.

I generated these randomly shuffled bed files with this specific command.

 shuffle -chrom -i  highconf_250wider.bed  -g /Users/morova/bedtools/GRCh37.genome.txt

Is there a way to validate that shuffling executed as I expected. For example there is also ChromFirst option. Is there a statistical method that you guys should suggest me ?

Thank you very much,

Best regards,

Tunc.

chip-seq bedtools • 1.3k views
ADD COMMENTlink modified 4.1 years ago by Ryan Dale4.9k • written 4.1 years ago by morovatunc460
3

Testing for randomness is one of the hardest things to do in statistics - because one of the assumptions of true randomness is that anything can happen! Even very unlikely outcomes are essentially guaranteed to happen some proportion of the time. Practically this means that even if I provide you with the tools to visualize the before/after shuffled distribution, we won't be able to say 'nope, that distribution is too weird, can't be random' - because that by definition is something that can happen with a random shuffle.

There are methods to say, with some confidence interval, how random we think the shuffling is - but all those methods will require multiple independant shuffles. How many shuffles? Well, the less random the shuffle truly is, the fewer independant shuffles you'll need to be confident it's not random. But on the assumption that bedtools IS a good, random shuffle.... well.... a lot of shuffles. The tools commonly used are the Ent suite (http://www.fourmilab.ch/random/) and the Diehard suite (http://stat.fsu.edu/pub/diehard/).

A much better way of actually doing this is to look at the code, and make sure that all shuffling steps are known to be sufficiently random. I can't remember what language bedtools is in, I think python, so hopefully they used some generic shuffle function known to work well on a generic data structure. Note though that occasionally it's the case that, due to the insufficient granularity of the numbers from the random number generator, not every possible shuffle outcome is actually possible, so a code review is really the best way to detect these sorts of issues.

ADD REPLYlink written 4.1 years ago by John12k
1

First of all thank you for the detailed answer. I appreciate that. If I wasn't shuffling chromosome locations, I would have draw a distribution graph of the numbers and check if it uniform(or looks similar to what it was aimed). On top of that, since I use —chrom option, it keeps the randomization event on the same chromosome. These two specific things makes me wonder if that randomisation is right or wrong.

ADD REPLYlink written 4.1 years ago by morovatunc460

No problems, I find myself sitting in a dull, grey waiting-room all day today, so plenty of time to help out :)

So I just looked at what bedtools shuffle actually does, and it doesn't do anything like what I expected. I thought it would shuffle the cards, but not change the cards themselves.... but it doesn't shuffle the order of the intervals in the file, it literally changes the interval start positions while maintaining their size. Why this is called a shuffle rather than a permutation I have no idea -_-; This is relevant because for a shuffle, the entire end-of-shuffle result is 1 data point for the randomness of the shuffle. For a permutation like this, each new position of an interval is 1 datapoint for the randomness of the shuffle, so you won't need to do multiple "shuffles".

So on that note, my previous advice doesn't really apply. You can test if this is random in a single "shuffle" by plotting the distribution as you say, but it's raised SO many questions in my head unrelated to the randomness of the shuffle. Can shuffled data include overlapping intervals? If not, we've got a space-filling-search problem, which is REALLY hard to make random. Really hard. If overlapping regions are OK then at least we can have some proper randomness in the new positions, but if your input data doesn't allow for overlapping intervals, this isn't fair.

I don't think using --chrom will cause much of a problem. Some chromosomes will have different interval sizes/distributions than others, and --chrom will maintain that I guess.

But this is just one of those things where, frankly, I wouldn't go down this path at all. Whatever your regions are for, it's guaranteed they could not possibly appear randomly throughout the genome. You've got centromeres/telomeres/gene deserts, long repeats, etc etc. To ignore all that will screw everything up. But I don't have any ideas on how to do it better...

Perhaps, if you're looking for motif enrichment locations or something, you could use another organism's genome that doesn't have the motif/transcription factor as your random shuffle? I don't know. I just know that taking a series of non-overlapping, non-random intervals, and then randomly moving them around, perhaps with overlaps, to a genome that has a variable amount of possibility for your features to exist at all, is not bootstrapping :-/

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by John12k
3
gravatar for Ryan Dale
4.1 years ago by
Ryan Dale4.9k
Bethesda, MD
Ryan Dale4.9k wrote:

Relevant paper: The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment

ADD COMMENTlink written 4.1 years ago by Ryan Dale4.9k

Sir thank you very much for the article. I will definitely take a look at it.

ADD REPLYlink written 4.1 years ago by morovatunc460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1666 users visited in the last hour