Hello, I have got fastqs from an easy-HiC sequencing experiment: the samples are embrionic stem cells and neurons from mice. We used UMIs (unique molecular identifiers) to tag the reads and the sequencing was paired-end.
I am new to the HiC field and wondering how exactly to interpret the fastQC or multiQC reports to know that everything went fine in the sequencing and continue with the analysis. The per base and per sequence quality was very good for all samples, but for other fields such as the general statistics (duplication levels) and GC content I thought it was better to consult.
1) is that level of duplication expected? I discussed it with a colleague and we thought it could be due to the fact that we did not use the UMIs to sort the reads yet.. so maybe a lot of reads are taken as duplicates at this point. Does this make sense?
2) For the GC content plot, please ignore the three red peaks that are at the top (they are from the three undetermined sequences file). My question is mostly about the curves that I am pointing at with a sky-blue arrow. Some ESCs and neurons files have a bell-curved shape (arrow at the left) whereas others have a strange, squared-like shape (arrow at the right). I am wondering if there is something wrong with the files that have this squared distribution or if this is somewhat expected in this kind of experiments.
Thank you for your time! Best regards.