CNVkit for germline samples
Entering edit mode
6.0 years ago

Dear all, I am trying to use CNVkit to detect copy number variant in my samples. My sequencing data is derived from peripheral blood so I am looking for germline copy number variants. I don't have tumor sample, only germline.

I followed the suggestions of the documentation and here is the flow of the pipeline that I used to analyse my data:

python3 /usr/local/cluster/bin/ batch germline_sample.bam --n --targets brca_slop.bed --fasta hg19.fa --output-reference germline_sample.cnn --output-dir cnvkit_brca/
python3 /usr/local/cluster/bin/ segmetrics -s{s,r} --ci
python3 /usr/local/cluster/bin/ call germline_sample.segmetrics.cns --filter ci -o germline_sample.calls.cns

As test I used a sample that I know has a deletion of the exon 23 in BRCA1 gene. But I did not found any result:

python3 /usr/local/cluster/bin/ breaks germline_sample.cnr germline_sample.calls.cns 
Found 0 gene breakpoints
gene    chromosome      location        change  probes_left     probes_right
python3 /usr/local/cluster/bin/ gainloss germline_sample.cnr -s germline_sample.calls.cns 
*WARNING* No chrX found in probes; check the input
Treating sample gremline_sample as male
*WARNING* No chrX found in probes; check the input
*WARNING* No chrX found in probes; check the input
*WARNING* No chrX found in probes; check the input
Found 2 gene-level gains and losses
gene    chromosome      start   end     log2    depth   weight  cn      probes
BRCA2   chr13   32890547        32972958        -0.243229       418.202 36.7767 2       269
BRCA1   chr17   41197644        41276164        -0.795891       284.416 22.3257 1       309

Definitely I am missing something. Is my flow correct? Can someone help me in understand what is wrong and how I have to use CNVkit on my data? Thanks for the help.


next-gen targetsequencing cnvkit germline • 5.7k views
Entering edit mode
6.0 years ago
Eric T. ★ 2.8k

Your workflow is correct, but if this is exome sequencing data, then CBS segmentation won't be able to pick up a single-exon deletion. However, you can still see each bin's log2 value in the scatter plots, if you're reviewing genes of interest that way.

I've added a script to perform a simple per-bin test on the bins in a .cnr file:

(Pull the latest from the GitHub repo to ensure it all works as expected.)

This test will perform best if you constructed a pooled reference from multiple normal samples (the more the better). The output is a listing of the bins that have a log2 value farther from the neutral value than would be expected by chance, based on the reference spread (back-calculated from the .cnr "weight" column). Use the -t option to ignore off-target bins, and -a to specify a significance threshold other than 0.005.

python germline_sample.cnr -t -o germline_sample.ztest.cnr

It will also tend to report many of the same genes that were detected in your .cns file, so try comparing the two.

Entering edit mode

Hi Eric, Many thanks for your answer. I am not using whole exome data but I have shorter sequencing panel. I tried to analyse data as you suggested creating a pooled reference using the other samples as normal samples (in total 12 samples)

python3 /usr/local/cluster/bin/ batch germline_sample.bam --normal BAM_NORMAL/*.bam --targets brca_slop.bed --fasta hg19.fa --output-reference brca.reference.cnn --output-dir cnvkit_brca/

Then I used your new script to investigate log2 value, but I obtained 0 significant results. I also tried increasing the threshold with -a argument:

python3 ../cnvkit-master/scripts/ germline_sample.cnr -t -a 0.05 -o germline_sample.ztest.cnr
*WARNING* No chrX found in probes; check the input
Treating sample germline_sample as male
Ignoring 241 off-target bins
Significant hits in 0/86 bins (0%)

Do you think that it is a problem related with the limited dimension of the sequencing panel? Or, as you already suggested, the deletion is too small to be found by CNVkit. If CNVkit is not the correct tool to use in my case, do you have any suggestion about other tools?

Thank you so much for the help.


Entering edit mode

The small panel size shouldn't be a problem. But this test isn't especially sensitive. A couple questions:

  • Was this sequencing done by hybrid capture or targeted amplicon sequencing? The small panel size and small number of off-target bins suggests the latter. If so, use the batch command option --method amplicon to reduce the chance of errata in your output files.

  • What do you see if you use an unreasonably high alpha, e.g. -a 0.9? The output .cnr file will have an extra field called ztest, listing the p-values for each bin. If these are all 1.0, then something went wrong -- check brca.reference.cnn to see if the spread column there is all zeros, because that would mean the pooled reference was not built properly and the observed variance at each bin was not recorded for the Z-test. If you do see decimal p-values in your output .cnr file, you can jump to BRCA1 exon 23 and see the p-value there.

Note that this script adjusts p-values for multiple hypothesis tests, so it's possible that the log2 read depth ratio at BRCA1 exon 23 was not significantly below the range of normal samples at that site. You can do some of the math by hand: in brca.reference.cnn, find the bin of interest, then look at the values in the log2 and spread columns. If the log2 value in your test sample was less than the corresponding reference log2 value minus twice the spread value, then the Z-test has a chance of picking it up, otherwise the script will probably miss it.

Entering edit mode

One more thought: Since BRCA1 is already detected as hemizygous in this sample, you can use the -s option to to make the p-values relative to the surrounding segment, rather than to the genome-wide neutral copy number value (i.e. diploid). That will actually make the test less sensitive to a deep deletion within BRCA1, but in any case, it's relevant.

You should still be able to see the BRCA1 exon 23 in a scatter plot: scatter germline_sample.cnr -s germline_sample.cns -g BRCA1

If it's not showing up there, that bin may have been filtered out earlier in the pipeline for reasons that may be clear if you post the corresponding row in brca.reference.cnn here.

Entering edit mode

Hi Eric, sorry for my late answer!

  1. The panel is not an amplicon but hybrid capture. It is the Illumina trusight cancer panel.
  2. Also using high value of alpha in I obtain 0 significant results, for this reason no output is created.

I did again the analysis, this time using the whole sequencing panel and not only the BRCA regions. I also renamed the germline.sample file with the code 1294-15. Here are my commands:

python3 /usr/local/cluster/bin/ batch 1294-15_S3.bam --normal BAM_NORMAL/*.bam --targets trusight_cancer_manifest_slop.bed --annotate ../refFlat.txt --fasta /storage/genomes/hg19/fa/hg19_masked.fa --access ../access-5kb-mappable.hg19.bed --output-reference 1294-15_S3.reference --output-dir cnvkit_brca/ --diagram --scatter
python3 /usr/local/cluster/bin/ segmetrics -s cnvkit_brca/{s,r} --ci
python3 /usr/local/cluster/bin/ call 1294-15_S3.segmetrics.cns --filter ci -t=-2,-0.41,0.32,0.80,1.16 -o 1294-15_S3.calls.cns
python3 /usr/local/cluster/bin/ breaks cnvkit_brca/1294-15_S3.cnr 1294-15_S3.calls.cns
Found 0 gene breakpoints
gene    chromosome     location change  probes_left     probes_right
python3 ../cnvkit-master/scripts/ cnvkit_brca/1294-15_S3.cnr -s 1294-15_S3.calls.cns -t -a 0.9 -o cnvkit_brca/1294-15_S3.ztest.cnr
Ignoring 13806 off-target bins
Significant hits in 0/1942 bins (0%)
Entering edit mode

I looked at 1294-15_S3.reference and the spread column in the BRCA1 region, values goes from 0.07 to 0.23. This are the rows in the 1294-15.reference:

chromosome      start  end      gene    log2    depth   gc      rmask   spread
chr17   41197644       41197870 BRCA1   -0.233667       402.224 0.548673                0.122524
chr17   41199609       41199771 BRCA1   0.103046        482.229 0.512346                0.171214
chr17   41201087       41201262 BRCA1   1.06501 972.797 0.44            0.114292
chr17   41203029       41203185 BRCA1   -1.05646        214.986 0.519231                0.231122
chr17   41209018       41209203 BRCA1   -0.426566       319.434 0.454054                0.106979
chr17   41215299       41215441 BRCA1   0.789963        706.815 0.359155                0.0867988
chr17   41215840       41216019 BRCA1   -0.914717       294.176 0.379888                0.15975
chr17   41219574       41219763 BRCA1   0.00523668      476.821 0.359788                0.0815208
chr17   41222894       41223100 BRCA1   -0.119592       578.87  0.451456                0.0712118
chr17   41223100       41223306 BRCA1   -0.708345       379.228 0.441748                0.124703
chr17   41226297       41226589 BRCA1   -0.2294 533.487 0.428082                0.116091
chr17   41228454       41228682 BRCA1   -0.178406       406.412 0.355263                0.12438
chr17   41234370       41234643 BRCA1   -0.57327        409.311 0.421245                0.150709
chr17   41242910       41243100 BRCA1   -0.190733       413.064 0.452632                0.101584
chr17   41243401       41243672 BRCA1   -0.376218       515.847 0.413284                0.0922301
chr17   41243672       41243943 BRCA1   -0.34255        704.511 0.405904                0.0811848
chr17   41243943       41244214 BRCA1   -0.295492       770.875 0.398524                0.0890704
chr17   41244214       41244486 BRCA1   -0.412394       618.696 0.363971                0.0905656
chr17   41244486       41244757 BRCA1   -0.459422       645.29  0.369004                0.0805406
chr17   41244757       41245028 BRCA1   -0.419711       677.113 0.394834                0.100809
chr17   41245028       41245300 BRCA1   -0.376147       654.103 0.400735                0.0958498
chr17   41245300       41245571 BRCA1   -0.300744       725.954 0.391144                0.0834496
chr17   41245571       41245842 BRCA1   -0.462248       630.515 0.365314                0.0841404
chr17   41245842       41246114 BRCA1   -0.312216       685.552 0.360294                0.0558915
chr17   41246114       41246385 BRCA1   -0.114085       876.669 0.391144                0.0821786
chr17   41246385       41246656 BRCA1   -0.121144       798.206 0.409594                0.0919309
chr17   41246656       41246928 BRCA1   -0.529059       523.102 0.382353                0.122759
chr17   41247812       41247990 BRCA1   -0.617907       328.139 0.404494                0.190037
chr17   41249210       41249357 BRCA1   -0.225694       352.001 0.306122                0.0940648
chr17   41251741       41251948 BRCA1   -1.1039 205.166 0.410628                0.08851
chr17   41256088       41256329 BRCA1   1.05918 929.17  0.327801                0.0846479
chr17   41256834       41257024 BRCA1   -0.589004       317.016 0.368421                0.124643
chr17   41258422       41258601 BRCA1   -1.10625        221.562 0.318436                0.0995694
chr17   41267692       41267847 BRCA1   -0.764837       261.941 0.387097                0.147588
chr17   41275983       41276164 BRCA1   -1.17077        207.1   0.325967                0.151626

The bin of interest is this one:

chr17   41197644       41197870 BRCA1   -0.233667       402.224 0.548673                0.122524

As you can see the dimension of the exon is 226 bp, probably as you suggested before it is too small to be detected by the method. Otherwise I was wondering if could be a coverage problem even if it is high because it is about 200X. I also uploaded the scatter plot: BRCA1 scatter plot Following the documentation and your suggestions, the flow of the analysis seems correct to me. What do you think? Is there something that I am missing? At the moment I don't have other known samples to use as test for the analysis pipeline, but if you think that it is correct I will try this pipeline using all the samples we sequenced and see if something came up. Thanks for your support and sorry for the late answer!

Entering edit mode

It looks like your link to the BRCA1 scatter plot didn't work, could you post the link directly?

In your sample .cnr file, can you post the bin corresponding to exon 23 (matching the reference.cnn bin you posted)? I notice that in the reference, the baseline log2 value is about -.23 and the spread is 0.55, so as a rule of thumb the log2 value of a bin to be flagged by the z-test would need to be below (-.23 - 2*.55) = -1.33. With the multiple-test correction for 1942 targeted bins in your panel, it needs to be far below that. Is it?

As a sanity check, you can also look at the "depth" column relative to your overall on-target coverage (~200x). Hemizygous regions in a germline ought to be on the order of 100x coverage, while homozygous deletions should be 0 or very close to it. Do you see that in your data?

Entering edit mode

Hi Eric, here is the link of the scatter plot. I hope it now works. Here is the bin in the sample.cnr file

chromosome      start   end     gene    depth   log2    weight
chr17   41197644       41197870 BRCA1   439.704 -0.122365       0.719729

As you can see the log2 value is -0.12 and the coverage reported is quite high (439). It seems that the data is not suggesting the deletion, even if we know it is there... I also tried to use the gainloss as suggested in the doc to find the list of trusted genes :

python3 /usr/local/cluster/bin/ gainloss cnvkit_brca/1294-15_S3.cnr -s cnvkit_brca/1294-15_S3.cns -m 3 | tail -n+2 | cut -f1 | sort > segment-gainloss.txt
python3 /usr/local/cluster/bin/ gainloss cnvkit_brca/1294-15_S3.cnr -m 3 | tail -n+2 | cut -f1 | sort >  ratio-gainloss.txt
comm -12 ratio-gainloss.txt segment-gainloss.txt > trusted-gainloss.txt
cat trusted-gainloss.txt 

Does this means that actually cnvkit is finding something on BRCA1? If so, is the corresponding bin the one returned by

python3 /usr/local/cluster/bin/ gainloss cnvkit_brca/1294-15_S3.cnr -m 3 
gene    chromosome      start   end     log2    depth   weight  probes
BRCA1   chr17   41197644        41276164        -0.363556       439.704 24.014  459

If yes, it seems to be the result that I expected but why breaks does not find anything? Sorry to bother you but can you please better clarify the columns of gainloss result and how to interpret the scatter plot? I want to apply CNVkit to another set of data in which I do not know anything so I am trying to define a general pipeline of analysis and how to interpret the result of CNVkit to go further with investigation. Is there a sort of "rule" that I can apply that can help me in finding potential interesting bins?

Many many thanks for the help and the support.


Entering edit mode

The gainloss command doesn't do special calculations, it just slices up the .cnr and .cns files so that bins are grouped by gene name -- see here. The default threshold is a permissive log2 amplitude of +/- 0.2.

I suggest loading your input BAM in IGV and looking at the read pileup at this exon to see whether it suggests a deletion. Are there reads mapped there? Does the pileup have the same shape as other captured exons? Are the reads properly mapped, with reasonably high mapping quality scores?

You can see from the 'depth' column that in the bin chr17:41197644-41197870, your input BAM file had a depth of coverage comparable to other diploid genomic regions. That bin covers an exon and about 40bp of flanking regions on either side. If the deletion you detected through some other experimental method only covers a portion of the exon (maybe 20% of the bin), then the total number of sequencing read bases mapped to that bin could be closer to the genome-wide average despite a small deleted region within it.

Other possibilities:

  • If this bin contains repetitive sequence, has low mappability, or is similar to a pseudogene or other region of the genome, your sequencing data may have reads spuriously mapped to this site. That would result in a higher read depth estimate here even if the exon itself was deleted from the sample's genome.
  • There are multiple transcripts of BRCA1, and in some of them this exon is labeled 22 or 24. The bin chr17:41199609-41199771 is exon 23 in another transcript, though that one doesn't show a deletion either. So this probably isn't the problem, although I note there are other bins in your data that suggest hemizygous loss -- one is visible in your scatter plot.
Entering edit mode

Hi Eric, I aligned my samples to another reference, 1000G phase 3 (hs37d5), and I also applied BAM recalibration as suggested by GATK. In the end I applied the commands for cnvkit as I described above. The results are not so different from the others obtained before. Here is the new scatter

And here is the IGV screenshot:

looking at the .cnr file:

chromosome      start   end     gene    depth   log2    weight
chr17   41197644        41197870        BRCA1   284.416 -0.216939       0.717904
chr17   41199609        41199771        BRCA1   191.173 -1.02741        0.65278
chr17   41201087        41201262        BRCA1   540.331 -0.351247       0.610098
chr17   41203029        41203185        BRCA1   204.763 0.0380506       0.519382
chr17   41209018        41209203        BRCA1   281.151 -0.120488       0.633143
chr17   41215299        41215441        BRCA1   392.958 -0.21303        0.625686
chr17   41215840        41216019        BRCA1   210.184 -0.367901       0.603211
chr17   41219574        41219763        BRCA1   352.011 -0.275293       0.709543
chr17   41222894        41223100        BRCA1   402.102 -0.162655       0.72259
chr17   41223100        41223306        BRCA1   260.155 -0.280256       0.649862
chr17   41226297        41226589        BRCA1   401.932 -0.169354       0.806763
chr17   41228454        41228682        BRCA1   316.868 -0.0728314      0.73462
chr17   41234370        41234643        BRCA1   301.883 -0.188909       0.693081
chr17   41242910        41243100        BRCA1   268.932 -0.387088       0.682279
chr17   41243401        41243672        BRCA1   386.672 -0.198129       0.756574
chr17   41243672        41243943        BRCA1   507.321 -0.212978       0.776926
chr17   41243943        41244214        BRCA1   552.79  -0.0717992      0.776312
chr17   41244214        41244486        BRCA1   480.423 -0.225676       0.764032
chr17   41244486        41244757        BRCA1   511.284 -0.171162       0.766196
chr17   41244757        41245028        BRCA1   527.613 -0.0577554      0.760645
chr17   41245028        41245300        BRCA1   522.974 0.00286892      0.763092
chr17   41245300        41245571        BRCA1   523.694 -0.133811       0.775524
chr17   41245571        41245842        BRCA1   513.97  -0.143757       0.757546
chr17   41245842        41246114        BRCA1   532.996 -0.134984       0.786801
chr17   41246114        41246385        BRCA1   613.978 -0.0944476      0.803505
chr17   41246385        41246656        BRCA1   586.827 -0.0862513      0.799662
chr17   41246656        41246928        BRCA1   421.11  -0.125031       0.746117
chr17   41247812        41247990        BRCA1   230.045 -0.390317       0.571434
chr17   41249210        41249357        BRCA1   296.503 -0.0602397      0.642597
chr17   41251741        41251948        BRCA1   168.329 -0.249064       0.61157
chr17   41256088        41256329        BRCA1   606.427 0.0144035       0.67774
chr17   41256834        41257024        BRCA1   272.937 -0.17821        0.649028
chr17   41258422        41258601        BRCA1   188.441 -0.265598       0.606485
chr17   41267692        41267847        BRCA1   212.348 -0.150345       0.590012
chr17   41275983        41276164        BRCA1   177.884 -0.216486       0.580759

the bin of interest is

chr17   41199609        41199771        BRCA1   191.173 -1.02741        0.65278

And that is the exon 22 for some transcripts and the exon 23 for others, as you suggested. The deletion was confirmed with MLPA technique but in the report that I have for this sample it is indicated only the exon number with the deletion but it is not declared respect to which transcript.... The data supporting the deletion is not so strong but I think that I am ready to try cnvkit to unknown samples :) Last doubt on scatter plot: the yellow vertical line is the region of the gene while the orange horizontal line is the segment in the .cns file,is it correct? Again many thanks for the help.


Entering edit mode

Glad to see it worked for you. Yes, you're reading the plot correctly -- the orange line shows the weighted mean of the bin log2 values within a copy number segment, and the yellow span shows the coordinates of BRCA1 as the bins are labeled in the .cnr file (which is not necessarily identical to the gene boundaries in RefSeq, just the first and last .cnr bins labeled "BRCA1").

Does the z-test script give this bin a persuasively small p-value now?

It looks like this is a very focal hemizygous loss, which is generally difficult to detect from read depth alone. If you have SNPs called within this exon, you can also look at allele frequencies and phasing to confirm that one allele was preferentially lost.

Entering edit mode

Hi Eric, Even in this case the z-test result is 0:

[smerella@srv2-slim cnvkit_brca]$ python3 ../cnvkit-master/scripts/ cnvkit_brca/1294-15_S3_RGSorted.cnr -s 1294-15_S3.calls.cns -t -a 0.9 -o cnvkit_brca/1294-15_S3_RGSorted.ztest.cnr
Ignoring 15118 off-target bins
Significant hits in 0/1942 bins (0%)

As you said is not an easy deletion to detect and I have not SNPs called in the same region...

Again many thanks!!


Entering edit mode

Hi Eric,

This link

Does not work. Would it be possible, please, to share your code with me too?


Entering edit mode

Thanks for catching this, I've updated the link above. It's now:

Entering edit mode

Hi Eric,

This link isn't working, are you still sharing this with the community? Would you also have a sample.bam that can be used to verify exon deletions for a sanity check?

Many thanks


Login before adding your answer.

Traffic: 1602 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6