Question

ChIPseq paired-end bowtie2 concern regarding biological replicates

0

Entering edit mode

9.7 years ago

NHEJ ▴ 360

I am working with ChIP-seq paired-end data where there is concern that one or more of the biological replicates may not be very good, but it is unknown which replicate may have a problem (I suspect there has to be at least one poor replicate in the data). The first part of my question is very simple: what do you recommend that I do to find this replicate to either toss it or fix it with some quality trimming on the ends?

For the moment, I tried using bowtie2 to trim 10 bp from the 5' and 3' ends of the reads in each of my samples just to see whether this fixes my problem. To define what I mean by "problem": basically, my final results (gene list) does not come out as I would expect it to come out (there are no genes of a certain type that I am looking for based on my biological intuition for what I should be seeing). When I run bowtie2 with the trimming options set, I do indeed get my .sam files okay, but my error file tells me:

(ERR): bowtie2-align died with signal 2 (INT)
20172305 reads; of these:
  20172305 (100.00%) were paired; of these:
    3064536 (15.19%) aligned concordantly 0 times
    13699211 (67.91%) aligned concordantly exactly 1 time
    3408558 (16.90%) aligned concordantly >1 times
    ----
    3064536 pairs aligned concordantly 0 times; of these:
      807633 (26.35%) aligned discordantly 1 time
    ----
    2256903 pairs aligned 0 times concordantly or discordantly; of these:
     4513806 mates make up the pairs; of these:
        3111731 (68.94%) aligned 0 times
        617853 (13.69%) aligned exactly 1 time
        784222 (17.37%) aligned >1 times
92.29% overall alignment rate

I am not sure whether I can be happy with this alignment for this ChIPseq experiment (even though I have heard that ChIPseq experiments sometimes give at best 50-60% alignment). My overall alignment rate is always greater than 90% in each sample so maybe I have nothing to fear (or maybe this is too much?). Perhaps I shouldn't be trimming reads in the first place? So this is where my other concerns come in: is this alignment "good enough" to proceed further in my analysis (ultimately culminating in a new gene list)? Or should I run bowtie2 with other options like --qc-filter (filter out reads that are bad according to QSEQ filter)? Perhaps there is something else I can try besides for experimenting with bowtie2 parameters?

EDIT: I examined my error file for one of my samples for the cases of 1) no trim at all 2) trim 10 bp 3) trim 20 bp. In each case, the error file is literally identical to the one above, the only little difference being a very slight increase in the "aligned >1 times" category (which makes biological sense since the reads get shorter and thus less specific in terms of mapping to different places on the reference genome). Also, the resultant sam file becomes smaller the more base pairs you trim from both sides.

bowtie2 ChIP-Seq paired-end • 3.7k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by NHEJ ▴ 360

score 2 · Answer 1 · 2014-08-03

Hello VKX,

Why do you think one of your samples must be wrong? It is not healthy to be so suspicious of your own data ;)
Why do you think trimming 10bp off the ends of your reads would make things better?
What are you ChIP'ing, and what genes did you expect to see but aren't?

If you are worried about the quality of your data, the first thing to run is FastQC to see if you even have a problem. If you don't have any problems in quality of the first few bp of the reads, or with over represented sequences, chances are that trimming your reads will only make mapping worse. I mean, if you really want to get the overall mapping rate up, you could consider trimming all but the last base of your reads - I guarantee that this is an effective strategy for improving overall alignment rates.
Failing that somewhat "extreme" strategy, you could run your data through some of the early QC tools provided by deepTools - particularly the correlation plots between multiple BAM/FastQ files. If all but one of your samples are clustered together but one is significantly different, you know which sample is not playing ball. I'd also throw in a few datasets from external sources which should be somewhat similar to your dataset, just to make sure.

And if after all that your reads look good - well, your reads look good! As they say, the best measure of quality is mapped reads, and you're doing well on that metric - so if your biologically-soughtafter reads aren't there... well, you're about to learn something.... even if it's just the importance of always labelling your tubes ;)