I wanna preface this by saying I'm relatively new to NGS analysis.
I recently received raw data from WXS (paired end 100bp reads with 100X coverage (12Gb data) with Agilent SureSelect All Human Exon V5 kit). I noticed something's really off from the getgo.
The file size between the normal and tumor pair are enormous. Read 1 and 2 fastq of the normal sample are about 7GB each, and the reads 1 and 2 from the tumor sample is about 40GB each. I've worked with exome data before, and they were usually close in size.
Anyway, I assume everything is okay, and put the raw data through the "pipeline." You know the usual, alignment, sorting, duplicate marking, indel align, mate-info fixing, Base-Recalibration, etc. Almost 45% of the reads failed the "DuplicateReadFilter"(GATK-baserecalibrator ), and another 45% failed the mappingQualityzero filter(GATK-baserecalibrator )! I have to filter out >90% of my sequence reads!
In my previous runs, I've filtered out at most ~10%.
This made me run the FastQC on the reads, I should have done in the beginning. In the sequence duplication level section, FastQC reports that 25% of the seqs will remain if deduplicated! I see double peaks in the "per sequence GC content"!
All this is new to me. I'm used to seeing yellow bars in the sequence quality section, I don't see it here for some reason.
Would someone show me the ropes?
I suspect that something has gone wrong during the DNA extraction and/or library preparation, including the possibility that unequal starting concentrations were used; hence the disproportionate difference in base output. Then again, if you accuse the lab of having done something wrong, the first response that you'll get will most likely be a flat-out rejection that anything has gone wrong.
You could ask for things like DNA purity readings (e.g. from NanoDrop) and a gel image (if it was run), and also standard metrics from the sequencer, such as PF reads, Q30 bases, cluster density, total output, etc. Just say that you need them for your 'reports'.
Out of curiosity, what were the general stats from alignment?
47803854 + 0 in total (QC-passed reads + QC-failed reads)
20972021 + 0 duplicates
24883875 + 0 mapped (52.05%:-nan%)
47803854 + 0 paired in sequencing
23901927 + 0 read1
23901927 + 0 read2
20554832 + 0 properly paired (43.00%:-nan%)
23992038 + 0 with itself and mate mapped
891837 + 0 singletons (1.87%:-nan%)
40520 + 0 with mate mapped to a different chr
32160 + 0 with mate mapped to a different chr (mapQ>=5)
Okay, this just increases my suspicion that something went wrong during sample extraction or library prep. You should cautiously approach the lab and just ask for more run statistics, like the ones that I mentioned.
The last time that I saw a sample with those types of stats, it was later confirmed as a tricky sample that had been left too long during one particular process, for whatever reason.
You can also, of course, randomly extract a bunch of 100-500 reads from the FASTQ file and then BLAST these (https://blast.ncbi.nlm.nih.gov/Blast.cgi) out of interest. There could be contamination of some sort, but I feel that it's most likely just 50% 'junk' / degraded DNA.