Ok, I'm using some read-depth based algorithm for CNV(copy number variation）detection. My general question is: for read-depth based algorithm, should we mask out repeats from reference genome？
I used to run CNVnator （one read-depth based algorithm) with bam without quality filtering(that means there're many low-quality read mapping); and I got around 8000 CNVs (deletion+duplication) for NA12878 pilot data. But recently I'm changing my data pre-processing pipeline, which includes discarding mapping with quality Q< 20; and also removal of PCR duplicates (using picard-markduplicates), I then got 40000 CNVs for NA12878!!!
I then compared read depth on chr5 for both default and filtered bam file using IGV (see the picture);and also look up those newly-identified CNVs at UCSC browser. I would say those newly-identified CNVs are mostly repeats (LINE,SINE). Low-quality mapping tends to aggregate at such regions (because they are randomly chosen to map here) and removal of these reads will make such regions look like deletions. Am I correct?
I think I still need to remove those bad-quality mapping but also should deal with those repeats. If so, for any read-depth CNV algorithm, should we first mask repeats in the reference genome？ Or discard predicted CNVs from the results because they are unreliable?
See the picture: http://www.freeimagehosting.net/d9bea