Question: What Is A Whole Genome Background In Analysis Of Motifs Or Peaks?
4
gravatar for Curiosity
8.8 years ago by
Curiosity120
Curiosity120 wrote:

Why does peak analysis or motif analysis most often use a whole genome background, when they do not have any control to compare?

When I run 20k peaks for motif analysis. I picked 5000 target sequences and 40k background sequences. Why are the numbers different? Does it affect p-values (% of target sequences that have motif X versus % of background sequences that have motif X)?

genome sequence motif • 2.8k views
ADD COMMENTlink modified 8.8 years ago by Ian5.7k • written 8.8 years ago by Curiosity120
2
gravatar for Larry_Parnell
8.8 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

Yes, the numbers analyzed will affect the p-value because p-value is a confidence score and confidence changes with the number of tests run or to which you compare. That you "picked" 5000 targets and 40000 background sequences may mean that you have introduced a bias. Can you satisfactorily answer the question that those sequences were selected at random? A whole-genome as background removes that bias. It can be argued that a peak or motif could occur anywhere in the genome. After all, the last few years of results regarding control of transcription - and binding sites for proteins that regulate that process - indicates that binding sites can exist anywhere probably because much more of the genome is transcribed than was once thought.

ADD COMMENTlink written 8.8 years ago by Larry_Parnell16k

Thanx Larry. So picking 5000 targets and 40k background sequences is normal ? I used homer for this analysis.

ADD REPLYlink written 8.8 years ago by Curiosity120
1
gravatar for Ian
8.8 years ago by
Ian5.7k
University of Manchester, UK
Ian5.7k wrote:

I realise this has already been answered, but i have found using 'mapable' regions of the genome a good way of selecting control sequences. This is because not all areas of the genome can be sequenced.

HERE is an example using the UCSC hg18 human genome.

ADD COMMENTlink written 8.8 years ago by Ian5.7k

I quiet don't understand your answer. Could you please elaborate more. Or you mean this, there are sequences that are missed by sequencing machines so that we can use them as a good background ?

ADD REPLYlink written 8.8 years ago by Curiosity120

It is a bit of an assumption, but if a conservative mapping of NGS data results in uniquely mapping reads, then 'mapable" regions of the genome are good for constraining the genome space when selecting "random" regions to those areas that can be sequenced. It seems to me that selecting regions from the entire genome is wrong as there are parts of the genome that will never by satisfactorily sequenced or correctly mapped. Sorry for the ramble, i hope that helps explain my thinking.

ADD REPLYlink written 8.8 years ago by Ian5.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1527 users visited in the last hour