I stumbled across something I never saw before with raw FASTQ files during QC. We are working with ChIP-Seq samples (Mutant (MAR) vs WT, 3 replicates each, 1 IP and 1 Input (IN) for each replicates for a total of 12 samples).
A pattern is emerging while looking at mean quality scores per position in the raw reads : First 35 bp are somewhat low-quality, then there is a sudden increase in quality until the end of the read (75 bp), see this plot :
Here is what I know about the samples :
Samples were prepared using "TruSeq® ChIP Library Preparation Kit" (with 1% PhiX Control) and sequenced with Illumina NextSeq (single end 75 bp).
During samples preparation, after several steps, a point was reached with ~ 10M cells in 350uL. Then, 50uL were used for Input (~ 1.4M cells) & 2x150uL (~ 2x4.3M cells) for IP (one IP directed against the protein of interest and one non-specific IP for background noise).
Any suggestions for processing those reads ? Should I trim the first 35 bp regardless of the sample ?
I allow myself to edit the OP post to take into account all the good suggestions made in the replies. Especially regarding @Devon post to take depth bias into account.
The 5 samples which seem to have the best raw_mean_qual_scores_plot are
MAR_IN_3. They also are the samples with the highest number of raw reads.
These 5 samples happen to be the one considered as potential outliers on PCA plots generated from read count coming from un-normalized BAMs.
Below are the relevant options used in cmds leading to a PCA plot on read counts from normalized BIGWIGs (mm10 normalizeTo1x) using deepTools 3.0.1 (using raw reads, no trimming nor duplicates removal) :
bamCoverage --binSize 10 \ --effectiveGenomeSize 2150570000 \ --normalizeUsing RPGC \ --ignoreForNormalization chrX; multiBigwigSummary bins \ --chromosomesToSkip chrX;
Nb : I tried multiBigwigSummary with --binSize 10 but it was taking way too long