I recently processed some ChIP-seq sequencing data which had quite a few issues with it. The run consisted of two ChIP replicates and two input controls. For instance, the samples were supposed to have equal numbers of reads, but two of the samples (the input controls) were about 90% of the reads. There were a few other things wrong as well, the GC content of the ChIP samples was weirdly high, much higher than the reference. A histogram of the GC content of the ChIP sample reads (done by
fastqc) showed a weird curve that wasn't even vaguely Gaussian shaped. It looked like there might have been multiple overlapping Gaussians. Two sequences were over-represented and identified by
fastqc as Truseq adapters.
So, recall that only 10% of my run were reads from the ChIP samples. This amounted to about 10 million reads. After aligning wiith bwa, filtering for duplictes with picard and filtering for proper pairs with samtools, I ended up with 2 million reads. This was very low sequencing depth for my genome, with a total sequencing depth of 2, approximately 1 for each ChIP sample.
The peaks actually look somewhat reasonable. There are a few nice peaks located in some of the kinds of places we expect like at the 5' area.
Would you trust a run like this?
What are the general principles you have about accepting or saying a run is bad? My gut is the run is bad and the data is unreliable. I would rather not say that my 'gut' feeling is this data is unreliable to my wet lab colleagues.
I also think the coverage is much too low. Would the consensus be that two samples with a sequencing depth of one each is too low?
EDIT: Some more details about my run.
As the percentage of the mapped reads, I had 85% duplicates in one replicate and 57% duplicates in the other. I filtered the duplicates with picard. For the input controls, the duplication rate was 10% in one case and 25% in the other.