Question

Can someone help me come up with a step-by-step algorithm for NGS CNV analysis?

2

Entering edit mode

8.8 years ago

anjali.gopal91 ▴ 60

Problem: I want to analyze copy numbers for whole-genome next generation sequencing data, for non-human samples. Ideally, I'd copy number information displayed in something like the UCSC genome browser, which means I need copy number info for every point (e.g., assume window size of 1000) along the genome.

Attempted methods:

I start with bam files.

I've looked into a variety of tools (CNVnator, ReadDepth, cn.mops, etc.) but either 1) I've been having trouble getting these to compile, or 2) they don't work well for non-human samples, etc.

I have been using this thread as a base algorithm i.e., as Irsan points out:

Define the genomic windows: bedtools makewindows ... | sort-bed - > yourGenomicWindows.bed
Count how much reads with mapping quality bigger than 35 map to the genomic windows for each sample. With the bedops suite do: bam2bed < yourSample.bam | awk '{if($5>35)print $0}' | bedmap --count yourGenomicWindows.bed - > yourSample.count
put the count files in one big matrix
import in R (or any other environment where you can perform numerical operations)
Normalize for library size: divide the count in each genomic window by the amount of million mapped reads of that sample
Get baseline: calculate the median value for each genomic window across all samples (or some other samples of which you are sure they are copy number neutral). You will notice that for some genomic windows the median read count will be 0. This means this is a genomic region that is hard to sequence/map. Usually these windows are located near centromeres and telomerers, just deleted those windows, they are not informative.
For each sample, divide the count of each window by the baseline count of that window and log2-transform. Be careful with samples that have homozygous deletions. They will have count 0 so when you calculate the log2(tumor/baseline) you will get -Infinity. As a solution, make sure the minimum and maximum numbers in your data are for example -5 and 5
Segmentation ..."

However:

What is segmentation, and why is it useful?
How does this help me deal with large-scale deletions or inversions?

ALSO, this doesn't have to be particularly precise, i.e., I know getting highly-accurate CNV algorithms are the things papers are made off :) I'm just looking for a somewhat-accurate starting point.

Thanks!

NGS CNV • 3.2k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by anjali.gopal91 ▴ 60

score 4 · Accepted Answer · 2015-06-26

Fast mode, if you just want results: Use CNVkit (quick start guide here). After segmentation, use "cnvkit.py export bed" to create a BED file you can load in UCSC Genome Browser. Let me know if you run into trouble with any of this.

Slow mode, if you just want to learn: Read the CONTRA paper. They use a straightforward method that is similar to what you're thinking, and the paper is well written and easy to follow. From there, read any review papers that interest you.

Segmentation is determining which genomic regions contain real copy number changes, and where the breakpoints occur, based on the bin- or probe-level data (e.g. exon read depths). This is the feature of biological interest, but due to the noisiness of bin-level copy ratios it is not obvious where the copy number breakpoints occur.

Large-scale deletions should show up as copy number losses. Inversions won't be detected by this method, but *insertions* of copies of existing genomic sequence will show up as copy number gains in the original/source sequence region.