Question

Meme: What is the optimal sequence length?

1

Entering edit mode

7.4 years ago

BioinfGuru ★ 1.7k

Hi all,

I am trying to detect novel and known motifs in a set of 43 genes that are tissue specific.

What is the optimum length of upstream sequence per gene that I include in the FASTA file for meme?
Also, is n=43 enough for significance? I can increase n if needed.

Previous attempts:

1) first nucleotide is 50 positions downstream of TSS, last nucleotide is 200 positions upstream of TSS, total length 250
2) first nucleotide is 50 positions downstream of TSS, last nucleotide is 400 positions upstream of TSS, total length 450
3) first nucleotide is 50 positions downstream of TSS, last nucleotide is 600 positions upstream of TSS, total length 650
4) first nucleotide is 50 positions downstream of TSS, last nucleotide is 800 positions upstream of TSS, total length 850
5) first nucleotide is 50 positions downstream of TSS, last nucleotide is 1000 positions upstream of TSS, total length 1050

Each of these attempts returns either no motifs or poor results - even the short run of 250 length close to the TSS. I'd have expected to produce something from that short run, so I just don't trust that I'm doing it right.

Now assume for a moment that the optimal length is 100: To identify all mofifs within 1kb upstream of TSS, should I actually be doing the following:

1) first nucleotide is 50 positions downstream of TSS, last nucleotide is 49 positions upstream of TSS, total length 100
2) first nucleotide is 50 positions upstream of TSS, last nucleotide is 149 positions upstream of TSS, total length 100
3) first nucleotide is 150 positions upstream of TSS, last nucleotide is 249 positions upstream of TSS, total length 100
4) first nucleotide is 250 positions upstream of TSS, last nucleotide is 349 positions upstream of TSS, total length 100
5) first nucleotide is 350 positions upstream of TSS, last nucleotide is 449 positions upstream of TSS, total length 100
and so on...

I am currently going through the previous meme questions and can't find an answer to this as yet apart from "Which is why the recommendation [for meme-chip] is short sequences of less than 500bp" here. Also, here: "DREME works best with lots of short (~100bp) sequences". So, I am starting to think that the shorter the better.

If someone could give me a diffinitive answer with a reference that would be great, but just some experienced advise would be great because the amount of time Im wasting on big meme runs is just silly now.

Thanks all in advance, Kenneth.

EDIT: I dont want to confuse the scope of the question. It is only MEME I am considering. Not Meme-chip. I am aware of what meme-chip does to long sequences. But this is not my concern. Only what is the optimal input length for meme

EDIT2: The original paper gives some clues in the section "sensitivity to noise" but Id still like some input from those who are experienced with using meme.

EDIT3: I have managed to find many significant motifs by using a sequence length of 100. I first create multiple input files. Each file contains 100 nucleotides from each gene - all the same distance from the TSS. e.g. file 1 = 43(0-100 upstream of tss, file) 2 = 43(80-160 nucleoties upstream of tss) so all files overlap. Using this as a basis, I have identified 46 significant motifs by meme (p<0.05). The runtime is also dramatically reduced. It appears I have my answer unless anyone knows any better.

meme sequence length fasta • 3.4k views

ADD COMMENT • link updated 6.9 years ago by pythiest • 0 • written 7.4 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

This article tells the following:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4175909/

"Limitations on sequence length and number

The MEME-ChIP web server supports analysis of data sets of up to 50 Mb, but it performs some of its analyses on subsets of these data. Most notably, it performs motif discovery (using MEME and DREME) on the central 100 bp of sequences, and MEME uses only 600 sequences. Using the central 100 bp works very well with ChIP-seq and CLIP-seq data, but a different length may be preferable for other applications. The sampling of 600 sequences for MEME is necessary to limit CPU usage per MEME-ChIP job on the (free) web server. If you wish to change either of these aspects of MEME-ChIP, you can do so if you install and run MEME-ChIP on your own computer (Box 4)."

But it seems to me you have already seen it...

ADD REPLY • link 7.4 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Thanks but I was aware of that alredy . I dont want to confuse the discussion with reference to meme-chip. It is only meme i am considering.

ADD REPLY • link 7.4 years ago by BioinfGuru ★ 1.7k

score 0 · Answer 1 · 2017-05-31

From Bailey & Elkan "Fitting a mixture model by expectation maximization to discover motifs in biopolymers":

"5.3 Sensitivity to noise ...Related to this problem is the fact that the motif occurrences may be short compared to the length of the sequences in the dataset. The longer the sequences, the more difficult it will be for MEME+ to discover the relevant motif occurrences. In this sense, all non-motif-occurrence positions in the dataset can be thought of as noise..."

In other words, the shorter (while you can be fairly certain that most of your target motifs are captured), the better. Additional sequence length adds not only random noise, but also the chance that other relevant biological motifs be present in your sample.