Regions of the genome that may include CNVs
1
1
Entering edit mode
9.2 years ago
alesssia ▴ 580

Dear all,

I am preparing a list of regions of the genome that are lucky to include CNVs. To do so I am excluding assembly gaps, regions with poor mappability, and repeat regions as reported in UCSC. I know from the literature that regions worth excluding are also those near centromeres/telomeres, and those having low/high GC. My questions are: a) what "near" a centromere/telomere means? and b) which are meaningful thresholds for GC content? Finally, c) is there any other feature I should be aware of?

Thank you very much!

UCSC CNV • 2.6k views
ADD COMMENT
0
Entering edit mode
9.2 years ago

If you have not used a repeat masked genome for your alignment, you may want to exclude these regions as well (available through ucsc tracks). Doing so + excluding gaps, would automatically handle centromere/telomere regions.

Most CNV callers automatically handle CpG content by normalizing number of reads to this. You don't need to filter high CpG content regions of the genome.

ADD COMMENT
0
Entering edit mode

I am not selecting regions to run a CNV caller, but to create a null distribution for a set of statistical analyses. Therefore knowing how to deal with CpG content is important.

Thanks, for the answer about the telomere/centromere. I am indeed removing all the gaps (http://genome.ucsc.edu/cgi-bin/hgTrackUi?&c=chr17&g=gap) as well as the repeating regions. Should I also remove "Regions of Exceptionally High Depth of Aligned Short Read" (http://genome.ucsc.edu/cgi-bin/hgTrackUi?&c=chr17&g=hiSeqDepth)? What is a good threshold in this case?

ADD REPLY
0
Entering edit mode

Most of the regions with exceptionally high depth of aligned short reads must be excluded by repeat masker/gapped regions. Those that are not would have a coverage depending on your library size. Best would be to make several histograms and make a decision based on that.

However I would not be much worried about those regions for the purpose of copy number variation calling because they would have a similar high coverage in all your cases and controls.

If you insist on removing those regions, do not do it based on definitions of the ucsc track, but use a coverage filter of your own data.

ADD REPLY
0
Entering edit mode

I am not doing a copy number variation calling: I am just selecting genome regions to crate a null distribution to perform a set of statistical analyses, that is: I have all the genome (as in UCSC), but I need to extract only those regions that may include a CNV to not bias (in my favour) the analyses. But, yeah, I got your point: regions with exceptionally high depth of aligned short reads are a problem of UCSC (or of a specific experiment), not in general!

Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1138 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6