Question: Repetitive motifs on Illumina 450k derived sequences
0
gravatar for gbayon
4.7 years ago by
gbayon160
Spain
gbayon160 wrote:

Hi everybody,

As a part of some recent analyses using the Illumina 450k DNA Methylation microarrays, I have been running the MEME suite to find significant motifs in some Differentially Methylated Probes (DMP) subsets. Problem is, I have found strange results in the shape of the same motifs coming out again and again.

Using the FDb.InfiniumMethylation.hg19 and the BSgenome.Hsapiens.UCSC.hg19 R/Bioconductor packages, I generate DNA sequences of 200bp length centered on the probes being processed and save them as FASTA files. Afterwards, I feed them to MEME and wait for the results.

Some motifs were appearing for every subset we were testing. Specifically, the most common motifs were repetitive sequences of the same nucleotide (polyA, polyC, polyG, polyT). This raised some suspicions, so we decided to try the motif finding procedure on two subsets containing 300 and 1000 random 450k probes. Problem is, the same motifs appeared again.

So, it seems that those motifs are somehow present around the 450k probes. Is this a probe design consequence? I am also wondering if the MEME parameters could be behind these results. I am currently running with the following options:

meme {input.fasta} -dna -nmotifs 10 -evt 0.01 -maxw 50 -maxsize 10000000

Just wondering if the prior distribution of nucleotides in the vicinity of 450k probes does not meet the statistical assumptions of the MEME algorithm.

Has anybody here experienced a similar problem? Any help or hint would be much, much appreciated.

EDIT: I am including a capture of MEME's output to show how the motifs look like:


ADD COMMENTlink modified 10 weeks ago by Ventrilocus50 • written 4.7 years ago by gbayon160
0
gravatar for Ventrilocus
10 weeks ago by
Ventrilocus50
Netherlands, Rotterdam, ErasmusMC
Ventrilocus50 wrote:

Dear gbayon,

what you are seeing here is probably the sums of two biases:

1) Given: i) bisulfite conversion, ii) whole genome amplification and iii) 450K may target plus or minus strand: plus strand C's become T's and minus strand G's become A's. As a result, you face reduced complexity: plus strand only contains {A, T, G} and minus strand contains {A, C, T}.

2) 450K is biased to target CpG islands, expected to be [G+C] rich regions. Low [A+T] content translates again into reduced complexity.

I am not familiar with the MEME suite tool. However, in tools like Homer, you normally input a background as well. Including a background for all 450K probes should solve those biases. In any case, those false discovery rates are inflated by other factors (for further reading go to: http://homer.ucsd.edu/homer/motif/)

Best, Ben.

ADD COMMENTlink written 10 weeks ago by Ventrilocus50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour