Large difference in mapping times for ChIP data
2
0
Entering edit mode
5.7 years ago
John 13k

I have two sequencing runs:

We'll call them Input and H3K4 for short.

The FASTQs of Input: 56,823,632 reads
The  uBAM  of Input: 56,823,632 reads
The   BAM  of Input: 56,824,033 reads
H3K4: 28,799,788 reads

So far so good. Here are the mapping rates:

Input:

56824033 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
56456641 + 0 mapped (99.35%:-nan%)
56824033 + 0 paired in sequencing
54529248 + 0 properly paired (95.96%:-nan%)
56318612 + 0 with itself and mate mapped
138029 + 0 singletons (0.24%:-nan%)
1702848 + 0 with mate mapped to a different chr
1278811 + 0 with mate mapped to a different chr (mapQ>=5)

H3K4:

28799788 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
26885619 + 0 mapped (93.35%:-nan%)
28799788 + 0 paired in sequencing
26577370 + 0 properly paired (92.28%:-nan%)
26709350 + 0 with itself and mate mapped
176269 + 0 singletons (0.61%:-nan%)
124608 + 0 with mate mapped to a different chr
100145 + 0 with mate mapped to a different chr (mapQ>=5)

Again both good, and seemingly well-mapped. But here's the weird thing: the mapping for Input took 1hr 15min 54s, while the H3K4 sample took just 15min 44s. 1/5th of the time. In fact so quick im concerned it didn't even work, despite what flagstat is telling me.

Is this common? Should I be worried?
All mapping is happening at the same time, on the same server, which is no where near half-load for CPU or RAM.

BWA ChIP-Seq • 1.8k views
0
Entering edit mode

Are you using the cluster or have you copied these to a local disk?

0
Entering edit mode

Local disk, although its a small compute server (32cores, 64Gb ram) -- I'm trying to work out how long it might take on my local machine before moving it up to the cluster, and these widely-variable compute times make it tricky (although i just checked maximus and it has like 100 cores free and more bytes of ram than i can parse with my eyes :P - certainly would cut my mapping time to, like, 10% of the time if i used that! I just feel a bit weird using it because no one from Bioinfo. has ever given me their official blessing to use it, told me what constitutes as 'acceptable load', etc. So right now i'm on my own little island trying not to bother anyone :P

But really i'm just blown away by how variable mapping times take. I would have thought it was a straight forward function of number-of-reads. I guess Input is genome-wide and H3K4 is going to be many reads in the sameish locations....? Eh... I should really mark dupes and see if that correlates with mapping time. I wouldn't have thought so, but as today's seminar proves i'm frequently wrong! :)

EDIT: For a full list of the mapping times, stdout/etc, its all on http://log.bio -> view table -> Filters -> Date -> Today. The date-picker dropdown might not show up because I suck at everything web. Really looking forward to finishing this analysis off so i can finish off log.bio -_-

EDIT EDIT: Oh and get a PhD or whatever :P

1
Entering edit mode

I'm happy to dole out official blessings in return for an espresso :P Anyway, feel free to pop down to the office, we don't bite and you needn't be stranded.

My guess is also that this has to due with the peculiarities of H3K4. If you're really nervous, just put the samples through bowtie2 and see if you get similar results. multiBamSummary followed by plotCorrelation in deepTools should give telling results.

0
Entering edit mode

Hahah - ok, deal! :D

And I think you might be right - its something to do with the reads themselves being easier to map. I will come back to this post in a few days with a plot of "read #" vs "mapping time", with the dots coloured by histone modifications :)

3
Entering edit mode
5.7 years ago

Repetitive sequences take much longer to map (assuming that you're not masking the repeats). The input includes the repetitive portion of the genome, while (I'm guessing) the H3K4 sample contains considerably less.

2
Entering edit mode
5.7 years ago
John 13k

As Devon and Harold have suggested, it seems to be the assay itself that determines how long mapping will take.

Here's some data :)

(link to full-screen with log scale)

(link to full-screen with continuous scale)

2
Entering edit mode

Those are some nicely convincing graphs!

2
Entering edit mode

If I recall, some histone marks will predominately pull down low-complexity or repetitive elements (I don't recall precisely which ones, but H3K9me3 rings a bell). This could explain running time differences if the reference being used is not masked.

0
Entering edit mode

Interesting thought Chris. Personally I thought it might correlate with unmapped reads, so I went ahead to see if there was any correlation (as in, the more unmapped reads you have the more BWA spends an inordinate amount of time trying to map things which wont go anywhere).

The result however seems to me like there is not much of a correlation at all. It seems that, as you say, mapping low-complexity reads is more time consuming that mapping high complexity reads, even if the result is they both map.

So I guess the conclusion of this whole thread to me is that some assays take longer than others - because their reads take longer to map. Perhaps this would have been obvious to someone who knows the details of the BWA algorithum, but I find it interesting. I would have thought that if reads were looked up from a sort of lookup table [DNA : position on genome], there would be no reason why two uniquely mappable reads would differ in the time to find a position. I know Burrows Wheeler doesn't use a lookup table, but yeah, i think its cool. Perhaps BWA isnt very smart in how it chooses the initial subfragment of a read to map (like maybe it takes the first 20bp, then keeps adding 1 until it finds a match?). Perhaps it would be better to find the 20bp of highest complexity in the read first, then start mapping. That could possibly make BWA 10-100x faster at mapping low-complexity DNA.