CNV analysis tool on exome data for NGS
6
9
Entering edit mode
9.9 years ago
subhajit06 ▴ 110

Dear all,

I have a question regarding Copy Number analysis on Exome sequencing data.(NGS data)

I have multiple BAM files (around 30) and I have some target regions which I want to check if there is any Copy number gain.

What would be the best way to do it? I am a newbie in this field and it seems there are lot of tools that do CNV analysis and I have no clue how to choose one and do the analysis.

thanks,

--Subhajit

bam exome next-gen cnv • 13k views
ADD COMMENT
0
Entering edit mode

Hi Hersman, Jorge and Fred .. thanks for your comments. I will try to play around with those softwares you guyz mentioned.

ADD REPLY
0
Entering edit mode

Lots of answers in these previous questions. (If there weren't a bunch of answers on this question already, I would have closed it as a duplicate)

ADD REPLY
8
Entering edit mode
9.9 years ago
Irsan ★ 7.8k

It is really not that difficult, you can easily get genome-wide copy number estimates yourself. Then perform CBS segmentation on those copy number estimates in case you have tumor samples, if you have non-tumor samples you could also use an HMM-based segmentation program.

In order to get copy number estimates based on read depth you have to compare genomic windows across samples. You cannot compare genomic windows within a sample unless you perform some smart normalization trained on the behaviour of the baits of your sample prep kit. So you need samples to serve as a baseline for each of your 30 cases. In the best scenario, you have matched data (e.g. tumor-normal pairs). If you don't have matched data you have to create a baseline for each genomic window based on the median of your 30 samples or better, a pool of samples from which you are sure they are copy number stable and sequenced with the same procedures.

So what you need to do is:

  1. Define the genomic windows: bedtools makewindows ... | sort-bed - > yourGenomicWindows.bed
  2. Count how much reads with mapping quality bigger than 35 map to the genomic windows for each sample. With the bedops suite do: bam2bed < yourSample.bam | awk '{if($5>35)print $0}' | bedmap --count yourGenomicWindows.bed - > yourSample.count
  3. put the count files in one big matrix
  4. import in R (or any other environment where you can perform numerical operations)
  5. Normalize for library size: divide the count in each genomic window by the amount of million mapped reads of that sample
  6. Get baseline: calculate the median value for each genomic window across all samples (or some other samples of which you are sure they are copy number neutral). You will notice that for some genomic windows the median read count will be 0. This means this is a genomic region that is hard to sequence/map. Usually these windows are located near centromeres and telomerers, just deleted those windows, they are not informative.
  7. For each sample, divide the count of each window by the baseline count of that window and log2-transform. Be careful with samples that have homozygous deletions. They will have count 0 so when you calculate the log2(tumor/baseline) you will get -Infinity. As a solution, make sure the minimum and maximum numbers in your data are for example -5 and 5
  8. Segmentation ...
ADD COMMENT
0
Entering edit mode

Thanks Irsan for your detailed answer. Let me figure out all the subparts of your answer :) as I am kind of learning these all tools these days.

ADD REPLY
0
Entering edit mode

Good luck! When you are done perform loss of heterozygosity analyses on your data as well

ADD REPLY
0
Entering edit mode

This approach might work well for whole-genomes, but the problem on exomes is much harder to solve due to variation in depth of coverage.

ADD REPLY
0
Entering edit mode

Indeed, copy number estimates obtained from whole genome sequencing are more accurate. Still, you can do good copy number and LOH analysis with exome data described above. I have done it for tons of samples.

ADD REPLY
4
Entering edit mode
9.9 years ago
hershman ▴ 40

They all suck - the data is too noisy. The folks at the Broad claims XHMM is the best, but I didn't get great results from it. I liked CoNIFER (was able to use it sucessfully here), but watch out - the code has a few bugs (If I remeber correctly, the depth data and probe locations can become missaligned).

ADD COMMENT
3
Entering edit mode
9.9 years ago

when we started caring about CNVs we knew that CNV detection through NGS data was not completely trustworthy, so we've decided to go for the algorithm that convinced us the most through its paper, and that has been exomeDepth. we have experienced great results using exomeDepth, but unfortunately we don't have any success rate at exome level to share. since we do exome sequencing because it's more cost-effective that sequencing a bunch of genes related to a pathology, we then focus on those genes only for clinical purposes. the false positive rate seems to be quite high I must admit, but the feeling we are getting (testing promising ones through MLPA and confirming some of them) is that the false negative rate is so low that we aren't missing the valuable ones. this is critical for clinical purposes, but maybe a high positive rate may not be that useful for proper full exome analysis, without being able to limit the scope of your region of interest in advance.

ADD COMMENT
0
Entering edit mode

Do you have any idea where this high false positive rate comes from? Mismapping due to homologe pseudogenes or amplification issues or something else?

ADD REPLY
2
Entering edit mode
9.4 years ago
Christian ★ 3.0k

Got good results with exomeCopy, with a few tweaks even on aneuploid samples. But its a tricky problem. Its immensly helpful to have a large cohort of samples run on the same platform for better baseline estimation and filtering for recurrent events.

ADD COMMENT
1
Entering edit mode
9.9 years ago
Fred ▴ 780

Maybe you can try Control FREEC, that handles exome sequencing data.

ADD COMMENT
0
Entering edit mode

Anyone here with experience running Control FREEC? How are the results?

ADD REPLY
0
Entering edit mode

It only works for exome data when you have matched samples. But when you have matched samples, copy number analysis is much easier anyway.

ADD REPLY
1
Entering edit mode
9.9 years ago
Chris Whelan ▴ 570

This paper offers an evaluation of four commonly used tools (XHMM, ExomeDepth, CoNIFER, and CONTRA). The results show how hard a problem it is:

http://onlinelibrary.wiley.com/doi/10.1002/humu.22537/abstract

ADD COMMENT

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6