Question: CNV analysis tool on exome data for NGS
9
gravatar for subhajit06
6.4 years ago by
subhajit06110
United States
subhajit06110 wrote:

Dear all,

I have a question regarding Copy Number analysis on Exome sequencing data.(NGS data)  

I have multiple BAM files (around 30) and I have some target regions which I want to check if there is any Copy number gain.

What would be the best way to do it? I am a newbie in this field and it seems there are lot of tools that do CNV analysis and I have no clue how to choose one and do the analysis.

thanks,

--Subhajit

bam cnv next-gen exome • 11k views
ADD COMMENTlink modified 6.0 years ago by Christian2.9k • written 6.4 years ago by subhajit06110

Hi Hersman, Jorge and Fred .. thanks for your comments. I will try to play around with those softwares you guyz mentioned.

ADD REPLYlink modified 9 months ago by RamRS30k • written 6.4 years ago by subhajit06110

Lots of answers in these previous questions. (If there weren't a bunch of answers on this question already, I would have closed it as a duplicate)

ADD REPLYlink modified 9 months ago by RamRS30k • written 6.4 years ago by Chris Miller21k
8
gravatar for Irsan
6.4 years ago by
Irsan7.2k
Amsterdam
Irsan7.2k wrote:

It is really not that difficult, you can easily get genome-wide copy number estimates yourself. Then perform CBS segmentation on those copy number estimates in case you have tumor samples, if you have non-tumor samples you could also use an HMM-based segmentation program.

In order to get copy number estimates based on read depth you have to compare genomic windows across samples. You cannot compare genomic windows within a sample unless you perform some smart normalization trained on the behaviour of the baits of your sample prep kit. So you need samples to serve as a baseline for each of your 30 cases. In the best scenario, you have matched data (e.g. tumor-normal pairs). If you don't have matched data you have to create a baseline for each genomic window based on the median of your 30 samples or better, a pool of samples from which you are sure they are copy number stable and sequenced with the same procedures.

So what you need to do is:

  1. Define the genomic windows: bedtools makewindows ... | sort-bed - > yourGenomicWindows.bed
  2. Count how much reads with mapping quality bigger than 35 map to the genomic windows for each sample. With the bedops suite do: bam2bed < yourSample.bam | awk '{if($5>35)print $0}' | bedmap --count yourGenomicWindows.bed - > yourSample.count
  3. put the count files in one big matrix
  4. import in R (or any other environment where you can perform numerical operations)
  5. Normalize for library size: divide the count in each genomic window by the amount of million mapped reads of that sample
  6. Get baseline: calculate the median value for each genomic window across all samples (or some other samples of which you are sure they are copy number neutral). You will notice that for some genomic windows the median read count will be 0. This means this is a genomic region that is hard to sequence/map. Usually these windows are located near centromeres and telomerers, just deleted those windows, they are not informative.
  7. For each sample, divide the count of each window by the baseline count of that window and log2-transform. Be careful with samples that have homozygous deletions. They will have count 0 so when you calculate the log2(tumor/baseline) you will get -Infinity. As a solution, make sure the minimum and maximum numbers in your data are for example -5 and 5
  8. Segmentation ...
ADD COMMENTlink modified 9 months ago by RamRS30k • written 6.4 years ago by Irsan7.2k

Thanks Irsan for your detailed answer. Let me figure out all the subparts of your answer :) as I am kind of learning these all tools these days.

ADD REPLYlink written 6.4 years ago by subhajit06110

Good luck! When you are done perform loss of heterozygosity analyses on your data as well

ADD REPLYlink written 6.4 years ago by Irsan7.2k

This approach might work well for whole-genomes, but the problem on exomes is much harder to solve due to variation in depth of coverage.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Leonor Palmeira3.7k

Indeed, copy number estimates obtained from whole genome sequencing are more accurate. Still, you can do good copy number and LOH analysis with exome data described above. I have done it for tons of samples.

ADD REPLYlink written 6.1 years ago by Irsan7.2k
4
gravatar for hershman
6.4 years ago by
hershman40
Cambridge, Boston MA, United States
hershman40 wrote:

They all suck - the data is too noisy. The folks at the Broad claims XHMM is the best, but I didn't get great results from it. I liked CoNIFER (was able to use it sucessfully here), but watch out - the code has a few bugs (If I remeber correctly, the depth data and probe locations can become missaligned).

ADD COMMENTlink modified 9 months ago by RamRS30k • written 6.4 years ago by hershman40
3
gravatar for Jorge Amigo
6.4 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

when we started caring about CNVs we knew that CNV detection through NGS data was not completely trustworthy, so we've decided to go for the algorithm that convinced us the most through its paper, and that has been exomeDepth. we have experienced great results using exomeDepth, but unfortunately we don't have any success rate at exome level to share. since we do exome sequencing because it's more cost-effective that sequencing a bunch of genes related to a pathology, we then focus on those genes only for clinical purposes. the false positive rate seems to be quite high I must admit, but the feeling we are getting (testing promising ones through MLPA and confirming some of them) is that the false negative rate is so low that we aren't missing the valuable ones. this is critical for clinical purposes, but maybe a high positive rate may not be that useful for proper full exome analysis, without being able to limit the scope of your region of interest in advance.

ADD COMMENTlink modified 9 months ago by RamRS30k • written 6.4 years ago by Jorge Amigo12k

Do you have any idea where this high false positive rate comes from? Mismapping due to homologe pseudogenes or amplification issues or something else?

ADD REPLYlink written 6.0 years ago by Jimbou770
2
gravatar for Christian
6.0 years ago by
Christian2.9k
Cambridge, US
Christian2.9k wrote:

Got good results with exomeCopy, with a few tweaks even on aneuploid samples. But its a tricky problem. Its immensly helpful to have a large cohort of samples run on the same platform for better baseline estimation and filtering for recurrent events.

ADD COMMENTlink written 6.0 years ago by Christian2.9k
1
gravatar for Fred
6.4 years ago by
Fred740
Paris, France
Fred740 wrote:

Maybe you can try Control FREEC, that handles exome sequencing data.

ADD COMMENTlink modified 9 months ago by RamRS30k • written 6.4 years ago by Fred740

Anyone here with experience running Control FREEC? How are the results?

ADD REPLYlink written 6.4 years ago by Christian2.9k

It only works for exome data when you have matched samples. But when you have matched samples, copy number analysis is much easier anyway.

ADD REPLYlink written 6.1 years ago by Irsan7.2k
1
gravatar for Chris Whelan
6.4 years ago by
Chris Whelan550
Portland, OR
Chris Whelan550 wrote:

This paper offers an evaluation of four commonly used tools (XHMM, ExomeDepth, CoNIFER, and CONTRA). The results show how hard a problem it is:

http://onlinelibrary.wiley.com/doi/10.1002/humu.22537/abstract

ADD COMMENTlink modified 9 months ago by RamRS30k • written 6.4 years ago by Chris Whelan550
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1668 users visited in the last hour