I have identified SNPs for illumina paired-end reads mapped to reference genome and with variant caller varscan with out removing PCR duplicates it comes around 25K SNPs identified. But when I tried with all same parameters with removing PCR duplicates with picard, I end up with 27K SNPs identified. Why there is raise SNPs identification?. I thought by removing PCR duplicates we may end up with decrease SNPs.Please let me know you thoughts.
It can be possible: Lets take a hypothetical situation of a region with 100X coverage of which say 40 reads come from PCR duplicate and are of Reference base: and lets say from other reads 10 reads representing a variant.
If you don't remove PCR duplicates the % of variant will 10% but if you remove the PCR duplicates: variant percentage will increase and can cross the threshold to be called as a variant.
If above cases are more in your data: then you can encounter more SNV's (not SNP) after PCR duplicate removal.
I'm not familiar with how picard or varscan works, but my guess would be that removing duplicates skews the distribution of reads. If you have high coverage, picard might remove many reads that happen to come from the same starting position as PCR duplicates. (You're not talking RNA sequences, are you? That might explain it.) Reads with different variants would not be identified as duplicates, and this would even out the distribution of alleles. Varscan would then report rare variants that would now be less rare, relatively speaking. Usually, variant calling is statistics heavy stuff, and I'd not expect this kind of weakness, so this explanation is likely wrong :-)
Personally, I'd not remove duplicates unless I had reason to believe there are many of them - and if you get many duplicates from Illumina, you are Doing It Wrong.