Question

Why do Illumina 850k/EPIC arrays ignore CpGs which are "GC" in the forward strand?

0

Entering edit mode

17 months ago

ning ▴ 120

CpGs are symmetrical, in that a CG sequence on the forward strand is hybridized to a GC --- and both dinucleotides on each opposing strand are CpGs dinucleotides which can be methylated. Conversely, CpGs can be GC on the forward strand but CG on the reverse strand.

FORWARD -> 5'--CG--3'  [OR]  5'--GC--3' <- FORWARD
REVERSE -> 3'--GC--5'        3'--CG--5' <- REVERSE

The assignment of "forward" and "reverse" strandedness is more or less arbitrary.

Given the above, why does it seem like the Illumina 850k (aka EPIC) array only profiles methylation from CpGs which are CG in the forward strand, while ignoring CpGs which are GC in the forward strand? I would also love to hear if my premises are wrong.

suppressPackageStartupMessages({
  library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
  library(tidyverse)})
data(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)

IlluminaHumanMethylationEPICanno.ilm10b4.hg19 %>%
  getAnnotation() %>%
  as_tibble() %>%
  count(forward_seq=str_extract(Forward_Sequence, "\\[[ATCG]{2}\\]"))

# Results:
# # A tibble: 3 × 2
#   forward_seq      n
#   <chr>        <int>
# 1 [CA]          2922
# 2 [CG]        862927
# 3 [CT]            10

microarray illumina genome • 1.2k views

ADD COMMENT • link updated 17 months ago by Ram 43k • written 17 months ago by ning ▴ 120

1

Entering edit mode

I thought the information on the strand was contained in the strand variable:

anno <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
table(anno$strand)
     -      + 
432034 433825

Can it just be that the Forward sequence variable contains the genomic 5'->3' seq for that CG location (so the seq is always the '+' strand) ?

As far as I know CpG sites are those in which you have C->G when reading in the 5'->3' direction, so a 5'->3' GC (+) 3'->5' CG (-) site, such as the option you draw in your schema is not a CpG site.

ADD REPLY • link 17 months ago by Papyrus ★ 2.9k

0

Entering edit mode

Papyrus I agree with your interpretation of the Forward sequence and strand variables. But why should CpG sites exclude 5' -> 3' GC (+) sites when biologically they are expected to behave just like 5' -> 3' CG (+) sites, since the designation of the forward strand is arbitrary?

ADD REPLY • link 17 months ago by ning ▴ 120

1

Entering edit mode

Hmm, I'm no expert but I don't think 5'CpG3' and 3'CpG5' are sterically equivalent, these molecules are different, so, as any other genomic sequence motif to be recognized by enzymes etc. they will maybe behave different. (there's a nice figure on Wikipedia comparing CpG and GpC sites).

The designation of which strand is the "forward" strand is indeed arbitrary, but what is not arbitrary is that one end of the strand is 5', and the other is 3', and the CpG site is defined by reading in that direction, not by reading on the forward or reverse strand. 5'->3' directionality has biological/functional meaning such as in DNA replication or transcription.

ADD REPLY • link 17 months ago by Papyrus ★ 2.9k

1

Entering edit mode

Papyrus You're right, thank you! I had completely forgotten about the biochemistry of DNA and was only thinking in strings. As you have written, the forward strand is 5' -> 3', so a GC on the forward strand is a GpC site, not a CpG site.

ADD REPLY • link 17 months ago by ning ▴ 120

0

Entering edit mode

Glad to help! I also had to recall those concepts :)

ADD REPLY • link 17 months ago by Papyrus ★ 2.9k

0

Entering edit mode

I cross-posted this to Bioinformatics StackExchange: https://bioinformatics.stackexchange.com/q/20043/6520

ADD REPLY • link 17 months ago by ning ▴ 120