My question is related to the coverage of reads at two steps first, after mapping and second post-alignment processing. The main aim of my project is to identify the somatic mutation. For this, I would think that the coverage of the post-alignment process should be good. My variant calling pipeline is very standard with default parameters and LENIENT stringency criteria that consist of the following steps:
1. trimming (trimmomatic -0.35)
2. Alignment (bowtie2)
3. AddOrReplaceReadGroups, MarkDuplicates (Picard)
4. RealignerTargetCreator, IndelRealigner (GATK)
5. FixMateInformation (Picard)
6. HaplotypeCaller (GATK)
After step 2, I found that there are approx 1000 reads for the exome of interest. however, after step 3,4 and 5, we found less than 10 reads. I am not sure for the reasons, but we tracked the decrease in the read number after mark duplicate step.
Is there any paper/discussion to suggest the what should be the minimum and maximum required/acceptable reads to identify somatic mutations?
Is my pipeline creating this problem? Will using samtools + VarScan2 be a better solution without step 3 and 4? As I think the number of reads to identify somatic mutation is very low and because of that in VCF annotation, we are unable to identify the known mutations.
Any help suggestion is appreciated.