Question: The Need Of Repeatmask For Read-Depth Based Cnv Detection Algorithm
gravatar for Bioscientist
7.6 years ago by
Bioscientist1.7k wrote:

Ok, I'm using some read-depth based algorithm for CNV(copy number variation)detection. My general question is: for read-depth based algorithm, should we mask out repeats from reference genome?

I used to run CNVnator (one read-depth based algorithm) with bam without quality filtering(that means there're many low-quality read mapping); and I got around 8000 CNVs (deletion+duplication) for NA12878 pilot data. But recently I'm changing my data pre-processing pipeline, which includes discarding mapping with quality Q< 20; and also removal of PCR duplicates (using picard-markduplicates), I then got 40000 CNVs for NA12878!!!

I then compared read depth on chr5 for both default and filtered bam file using IGV (see the picture);and also look up those newly-identified CNVs at UCSC browser. I would say those newly-identified CNVs are mostly repeats (LINE,SINE). Low-quality mapping tends to aggregate at such regions (because they are randomly chosen to map here) and removal of these reads will make such regions look like deletions. Am I correct?

I think I still need to remove those bad-quality mapping but also should deal with those repeats. If so, for any read-depth CNV algorithm, should we first mask repeats in the reference genome? Or discard predicted CNVs from the results because they are unreliable?


See the picture:

read repeats cnv • 2.5k views
ADD COMMENTlink written 7.6 years ago by Bioscientist1.7k
gravatar for Stefano Berri
7.6 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

Some thoughts:

it is very important to compare your sample to a "normal". Often a pool of normal is used. Coverage depends on mappability. If you really can't have the matched normal, use any normal. Alternatively you can correct for mappability.

About 50% of the human genome is considered repeated, in some cases very large regions, you would lose a lot of your true positive. Furthermore, masking, let's say 50 bp, means losing approx 200 bp around that repeated region, if you reads are 80 bp long.

ADD COMMENTlink written 7.6 years ago by Stefano Berri4.1k

Sorry but what do you mean by "normal" here?

ADD REPLYlink written 7.6 years ago by Bioscientist1.7k

I would consider DNA from blood of other unrelated healthy patients. You can also use the 1000 genome project. Just make sure you trim the sequences to the same length and re-align with same aligner.

ADD REPLYlink written 7.6 years ago by Stefano Berri4.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 832 users visited in the last hour