Question

Chip Sequencing For A Tiny Genome

0

Entering edit mode

10.2 years ago

André Rendeiro ▴ 50

I'll soon be sequencing my ChIP samples of point-source transcription factors (TF) that I believe have a average to high number of binding sites throughout the genome compared with the "average TF". I am currently studying a organism, Oikopleura dioica, which has a very small genome for a chordate: 70Mb.

The ENCODE guidelines for point-source TF ChIP sequencing are "a minimum of 20 million uniquely mapped reads" (Landt *et al,* 2012) in mammalian cells and a tenth for worms and flies, per factor (combining replicates). For a human genome that would be a coverage of 0.6 or 1.2, and worm 2 or 4 depending on using a 100 or 200 bp read length (they don't specify on the paper).

If I am to have a say, 8X coverage for the samples of my organism (70Mb), I'd need to sequence 2.8 million 100bp PE reads (or have that amount mappable, but let's simplify for now) - a total output of 560Mb.

My samples are going to be run in a facility that has a Illumina Hiseq 2500. This instrument has a output capacity of around 150 million reads per lane, that's 30 Gb with 200 cycles (100bp PE). If I only use 2.8 million reads, they'd be able to fit more than 50 samples on a single lane of the instrument using multiplexing. I'll have about 9 samples only for now and I know there are some other small genome samples being sequenced around the same time, but the machine is mostly used with mammalian cells.

Concerning the sequencing depth for my samples: is my general reasoning correct? I am right to make the transposition between organisms based on coverage? Concerning how to manage the sequencing: what's the ideal way of handling this? Should I sequence more of my samples to avoid being on a cue for other small samples?

Thank you.

chip-seq illumina • 3.5k views

ADD COMMENT • link updated 10.2 years ago by Ian 6.0k • written 10.2 years ago by André Rendeiro ▴ 50

score 1 · Answer 1 · 2014-01-25

Genome coverage is not the right way to estimate and extrapolate ChIP-Seq experiments. The amount of required data depends very heavily on the number of bound locations and their occupancy.

The numbers that are quoted above come from averaging over a large number different factors but also correspond to large and repetitive and less packed genomes. It would not be surprising if these numbers would not scale at all for very different genomes. You may need a lot more or a lot fewer reads.

The best way to go about this type of situations is to run a pilot study with a fairly high coverage to catch even rare events then evaluate the rates and coverages.

score 1 · Answer 2 · 2014-01-27

1

Entering edit mode

10.2 years ago

Ian 6.0k

I once had a set of samples for yeast (S.c) with around 20mill reads per sample. It broke MACS (I think specifically the binomial calculation used to work out the optimum level of read redundancy). In the end I sampled 1.5 and 5 million reads. Yeast has ~12.1million bases (12Mb) so you calculation might be on the low side.... I certainly second Istvan in the a pilot is needed. I would play safe with 5-10 million reads (ChIP and input) and titrate down. Better to throw away than not have enough.