Question

Repetitive motifs on Illumina 450k derived sequences

0

Entering edit mode

8.3 years ago

gbayon ▴ 170

Hi everybody,

As a part of some recent analyses using the Illumina 450k DNA Methylation microarrays, I have been running the MEME suite to find significant motifs in some Differentially Methylated Probes (DMP) subsets. Problem is, I have found strange results in the shape of the same motifs coming out again and again.

Using the FDb.InfiniumMethylation.hg19 and the BSgenome.Hsapiens.UCSC.hg19 R/Bioconductor packages, I generate DNA sequences of 200bp length centered on the probes being processed and save them as FASTA files. Afterwards, I feed them to MEME and wait for the results.

Some motifs were appearing for every subset we were testing. Specifically, the most common motifs were repetitive sequences of the same nucleotide (polyA, polyC, polyG, polyT). This raised some suspicions, so we decided to try the motif finding procedure on two subsets containing 300 and 1000 random 450k probes. Problem is, the same motifs appeared again.

So, it seems that those motifs are somehow present around the 450k probes. Is this a probe design consequence? I am also wondering if the MEME parameters could be behind these results. I am currently running with the following options:

meme {input.fasta} -dna -nmotifs 10 -evt 0.01 -maxw 50 -maxsize 10000000

Just wondering if the prior distribution of nucleotides in the vicinity of 450k probes does not meet the statistical assumptions of the MEME algorithm.

Has anybody here experienced a similar problem? Any help or hint would be much, much appreciated.

EDIT: I am including a capture of MEME's output to show how the motifs look like:

DNA Methylation Motifs Illumina 450k meme • 2.6k views

ADD COMMENT • link updated 3.8 years ago by Ventrilocus ▴ 180 • written 8.3 years ago by gbayon ▴ 170

score 0 · Answer 1 · 2020-07-13

Dear gbayon,

what you are seeing here is probably the sums of two biases:

1) Given: i) bisulfite conversion, ii) whole genome amplification and iii) 450K may target plus or minus strand: plus strand C's become T's and minus strand G's become A's. As a result, you face reduced complexity: plus strand only contains {A, T, G} and minus strand contains {A, C, T}.

2) 450K is biased to target CpG islands, expected to be [G+C] rich regions. Low [A+T] content translates again into reduced complexity.

I am not familiar with the MEME suite tool. However, in tools like Homer, you normally input a background as well. Including a background for all 450K probes should solve those biases. In any case, those false discovery rates are inflated by other factors (for further reading go to: http://homer.ucsd.edu/homer/motif/)

Best, Ben.