Tool:ClinCNV: CNV detection from short reads
0
15
Entering edit mode
3.0 years ago

Dear community members,

we've prepared a tool for CNV detection (another one) called ClinCNV. It was already used for the analysis of around 5 thousands of samples sequenced on different platforms and the results are quite good, we also performed the benchmarking and found out that the tool is at least not worse than the competitors in germline context and works better for somatic context (using False Discovery Rate and concordance as metrics). You can check out a short presentation of the tool here (around 60 slides).

The tool uses cohorts of samples and read-depth (and BAF for somatic calling). It has quite a lot of features, such as clustering of samples prior to analysis, IGV visualization, polymorphic regions calling, mosaic CNV calling, different options for FDR control, etc. To have a quick overview I'd recommend to go directly to the docs. Try the test run with the command from here.

The limiting factor may be - we used ngs-bits for files preparation, however, it is an easy-to-install package, it is fast and has many useful features.

UPD the preprint is here, somatic part of ClinCNV. Please, criticize it. https://www.biorxiv.org/content/10.1101/837971v1

UPD2: ClinCNV's germline CNVs detection procedure and results were not published in any form - FIXED, below

UPD3: Tumor-only calling is implemented. Still requires approx 20 normal samples sequenced with the same enrichment kit. Highly recommended to be used with BAF-files and off-target reads. Limitations: less than 50% of the genome affected by CNVs, purity > 30%, no polyploidies. In summary - fine for blood cancers, maybe not good for 50% of the solid tumors. Still an experimental feature - one may send the results to me if they are unsatisfactory and we can decide what to improve.

UPD4: Germline CNV calling preprint is on bioRxiv and is citable https://www.biorxiv.org/content/10.1101/2022.06.10.495642v1

cnv calling cna Tool variant • 2.1k views
1
Entering edit mode
1
Entering edit mode

Thanks a lot, Kevin!

1
Entering edit mode

hey man i was making file preparation, and in the manual is:

////// Then you need to merge your ".cov" files into one table. To do this, you can use script mergeFilesFromFolder.R script provided with ClinCNV using input_folder and output_folder as variables to keep your absolute paths:

Rscript mergeFilesFromFolder.R -i $input_folder -o$output_folder \\\\

But with --help u can see the next -o CHARACTER, --out=CHARACTER output file name [default= out.txt]

this is right and manual is not.

Also, could you make your script for merging only .cov files rather everyone in the folder? If it not very hard, i think it would be good to allow use a wildcards: Rscript mergeFilesFromFolder.R -i *.cov -o batch.txt

1
Entering edit mode

Thanks a lot! Will fix it on Monday

0
Entering edit mode

I've tried to overcome this exception near hour. I think I can beat it, but now i should leaving. Maybe you can help me

[1] "Percentage of regions remained after GC correction: 0.957518796992481"
Error in gcNormalisedCov[which(!bedFile[, 1] %in% c("chrX", "chrY")),  :
subscript out of bounds
Calls: writeOutLevelOfNoiseVersusCoverage -> apply
Execution halted

I've proceeded files obtained byTruSightCardioSeqKit (alignmented on the GRCh37_latest_genomic.fna). yeap, i think i haven't got the chrY in my dataset

#################################

simple_command (i made the simplest one for first run on my data)

Rscript ~/progs/ClinCNV/clinCNV.R --normal $Files/batch_1.cov --bed$Path/gcAnnotated.extended_trusight.bed --out $Files/RES --numberOfThreads 8 Below 1. head -n 10 my_bed.bed; tail -n 10 my_bed.bed chr1 2985722 2985960 0.7017 chr1 3102587 3103138 0.6316 chr1 3160549 3160801 0.5556 chr1 3301612 3301950 0.5888 chr1 3303139 3303360 0.3801 chr1 3310955 3311158 0.7094 chr1 3312953 3313257 0.6151 chr1 3319253 3319662 0.6381 chr1 3321201 3321550 0.6476 chr1 3321957 3322312 0.7014 ...... chrX 153607743 153608479 0.6644 chrX 153608492 153608827 0.5851 chrX 153609011 153609657 0.6022 chrX 153640079 153640651 0.6958 chrX 153641442 153641693 0.6096 chrX 153641717 153642004 0.6202 chrX 153642336 153642627 0.5258 chrX 153647780 153648185 0.5926 chrX 153648269 153648703 0.6221 chrX 153648895 153649443 0.6150 . 2. head -n 5 batch_1.cov; tail -n 5 batch_2.cov X.chr start end X100_S3_Srt X102_S5_Srt X104_S7_Srt X106_S4_Srt X107_S9_Srt X108_S10_Srt X109_S11_Srt X110_S5_Srt X111_S6_Srt X113_S8_Srt X114_S9_Srt X116_S10_Srt X117_S11_Srt X125_S2_Srt X127_S3_Srt X129_S4_Srt X130_S5_Srt X131_S6_Srt X132_S7_Srt X133_S8_Srt X135_S9_Srt X136_S10_Srt X137_S11_Srt X139_S12_Srt X17_S2_Srt X23_S5_Srt X32_S4_Srt X52_S1_Srt X86_S2_Srt B_S12_Srt ry_S12_Srt chr1 112318597 112319000 77.273 124.1538 120.196 27.2283 137.1762 137.6774 143.9801 26.3077 44.1663 28.0819 79.0943 37.2357 47.5509 108.5236 87.34 147.5881 79.5186 70.3871 100.9355 30.3772 129.4888 153.3052 90.6998 126.866 115.6725 68.5782 120.9082 114.8635 46.7395 52.9504 82.603 chr1 112319546 112319995 83.412 116.5367 115.8998 35.1849 107.0111 105.4454 127.0022 26.92247.9599 30.7461 100.5323 42.6303 49.9844 56.6414 66.5234 89.5278 55.8842 56.098 70.5702 23.2138 84.902 121.3163 61.9198 79.0757 119.4053 61.0913 138.4454 97.4232 52.4365 53 71.0045 chr1 112320956 112321214 57.124 111.7713 79.3837 22.593 88.5155 96.3605 114.5349 21.2364 38.155 12.2519 66.7907 23.5426 40.8295 71.0116 65.1822 64.062 43.4612 47.1124 78.155 24.1434 70.2442 85.6822 58.8837 54.2326 81.1705 46.4264 93.1008 65.2326 41.0659 32.8915 61.0543 chr1 112322745 112323036 109.0997 122.4158 125.433 48.9588 119.6632 116.6186 145.1684 36.7938 55.9072 38.3162 110.3196 48.0309 70.3643 85.457 65.7182 102.7148 57.1581 50.2749 83.9897 17.1031 79.8247 106.9244 85.1512 98.2027 98.6529 62.2027 109.9966 134.9485 74.5223 71.677 71.866 .... chrX 32867743 32868037 44.5816 61.1599 54.1429 30.1361 68.0204 122.5442 67.8163 9.6599 27.7211 7.1769 37.2517 35.2313 36.2755 43.0306 30.9966 31.5612 27.5816 50.4014 38.5986 26.8776 25.6769 57.3061 69.7891 30.051 52.8299 52.3639 41.0646 32.381 34.53421.2483 74.5102 chrX 33038154 33038417 35.711 107.3992 104.365 32.7452 108.3954 168.0798 78.057 8.4335 20.7376 8.045646.6882 31.5247 42.384 102.7224 47.8669 86.5247 62.7224 104.9696 97.1711 45.1977 80.4259 164.0875 176.8327 35.1255 56.5323 81.365 79.0951 32.4563 24.6388 13.3612 109.8175 chrX 33146162 33146382 65.8364 114.0682 107.6 48.7227 136.7 190.2455 110.3227 19.6545 35.5818 26.9364 89.5682 73.5955 81.7864 74.1 31.6636 48.15 42.4227 63.0727 78.4136 36.5591 52.2864 130.5909 144.1955 49.9455 94.1136 148.0182 102.3182 83.4455 48.0136 38.7682 126.65 chrX 33229297 33229529 43.3793 77.8534 79.5259 32.6121 84.7672 128.0345 79.2629 15.7974 23.2328 12.1983 36.9353 26.4569 44.2241 57.9828 35.8448 40.0517 33.5043 79.9526 58.3534 18.8578 51.2198 97.6681 105.6595 30.0259 65.0948 78.3103 48.8966 41.3922 27.4397 15.8922 96.7543 chrX 33357274 33357482 43.4183 98.2356 69.0529 34.0288 83.2548 155.5817 81.6442 8.5481 36.6731 15.9567 47.4519 36.0337 61.3413 54.2452 39.9567 58.2837 34.7596 72.8221 62.0721 32.8798 56.0385 117.0337 123.8413 51.4038 74.3077 78.7692 77.7548 55.899 23.7404 24.7452 96.7163 p.s. the biostar makes hot mess when publish this post; i don't know how to save the table view of the data ADD REPLY 2 Entering edit mode Hey, I tidied your code and output via the 101 010 button. ADD REPLY 0 Entering edit mode Tidied again ADD REPLY 1 Entering edit mode oh, thanks, i see now how the magic 101 010 button :) sorry for mess, i think this is my first posts on biostar ADD REPLY 0 Entering edit mode ClinCNV for now does not like small panels of genes, mainly due to lack of testing - we simply have not included small panels into our test routine. ClinCNV likes bigger panels since it performs gc and length normalization and in small panels it is not so easy. I'll work on it on Monday, again, but what you can try right now - divide your on target bed file with the command BedChunk into pieces of length of 150 bp, for example. The way how to use the command is described in off target reads section. Then re calculate coverage and run it again. It solved the problem for our collaborators with the same panel, as I remember. ADD REPLY 0 Entering edit mode okay, thanks. I'll try it today ADD REPLY 0 Entering edit mode I found a test case that reproduces your error. Will fix it ASAP, will write you once it will be fixed. ADD REPLY 0 Entering edit mode I have a free time and sent my data to German. I did it a few minutes ago, seems that i've late. Sorry :| But anyway, hope the error can be simple fixed. ADD REPLY 0 Entering edit mode Try to make a git pull now =) and run the same command. it should work. ADD REPLY 0 Entering edit mode thanks for the data, it does work, I've sent you the results back. ADD REPLY 0 Entering edit mode Thank you for the tool... I am going to test it on a set of my data and I was wondering if you could clarify how you run a set of germline samples against a set of normal germline controls? ADD REPLY 0 Entering edit mode Hi Duarte! We do not use controls in ClinCNV. You provide some (as many as possible) samples sequenced with the same technology (and better in the same lab) and the tool infers CNVs for all the samples included, even if they are just controls. It is possible to run the tool only for one sample - flag --normalSample has to be specified then with the ID of the sample of interest. ADD REPLY 0 Entering edit mode Thanks I am now testing my samples. I am excited to see how your tool performs on them... However I do notice that the threads arguments does not seem to do much to improve speed. I gave it quite a few threads and I can see they are started (in the list of processes running( but they seem to all be dormant expect for 1 and the speed at which samples are being processed does not seem any faster that on a single thread. ADD REPLY 0 Entering edit mode That's correct - it is parallelised only partially. There are 2 time consuming steps which are parallelised - GC normalization and final calling. In theory, these 2 should work faster with more threads (but more than 8 does not make sense - for germline calling there are only 8 copy-number states). Please, let me know how the tool worked, how do you like an output, how do you plan to post-process the samples - and I'll try to help you with this. ADD REPLY 0 Entering edit mode the germline... you do not use the TSV files with b-allele frequencies at all? ADD REPLY 1 Entering edit mode at all. We've benchmarked the tool using B-allele frequencies and without them (for germline). For most of short CNVs there is no SNVs inside => no B-allele frequencies at all, but long CNVs can be detected using coverage only. So we removed this feature at all. Additional burden of time / no difference in benchmarking (only marginal, like by 1% of Precision/Recall in WES). However, I discourage running tumor CNVs calling without B-allele frequency - they are really changing the game there. ADD REPLY 0 Entering edit mode thanks could I offer a few suggestions? 1st) on the output tsv file for each sample, for some reason the length_KB field contains spaces. it seems an effort to gae the same number of spaces for each length? I really don't see the point and it will probably just lead to unwanted problems for people parsing those files expecting to use white spaces as a delimiter? 2nd) It would be cool if you could analise a set of input files and create a folder of the analysis of the group... this way you could use that data to analise a single sample of that set without requiring redoing all the initial reclustering and read depth analysis. I know you can set --normalSample to analyse only a single sample of the set... but the intial steps get redone... it that correct? ADD REPLY 0 Entering edit mode Thanks for suggestions! Indeed, we made same number of spaces for length because doctors asked us to do so (as I remember), they check results in excel and it was more convenient for them. The columns are tab separated, so spaces may be stripped from both ends of any cell value. I can implement "analysis only of the listed samples", that's not a problem, may be in couple of days. So far you may use - - reanalyseCohort F so ClinCNV will not try to reanalyse samples that you already analyzed (if their folders are created in the output folder) ADD REPLY 0 Entering edit mode Thanks ... but in relation to the second point I don't think you got the gist of what I was suggesting. I meant saving all the analysis data you do to a given cohort as a data file so that when you rerun and you say you want to only a analyse 1 sample, that initial process of clustering, gender detection etc... can be just read from the a file on the analysis results and not run every thing again.... for what I can see on the test I have done ... the analysis of 1 sample takes about 3 min on my panel... but the initialisation and clustering probably takes twice as much. if I wanted to invoke the process 20 times for 20 samples, it would run that initial clustering analysis every single time even though based on the same input files that clustering would be yielding always the same clustering results Current method: 1) read input data 2)cluster analysis, gender , coverage, etc... 3) run each sample analysis 4) Finish second time indicating a specific sample : Current method: 1) read input data 2) cluster analysis, gender , coverage, etc... 3) run just the specified sample or list of samples 4) Finish My suggestion: 1) read input data 2) cluster analysis, gender , coverage, etc... > stored as a file in the results folder 3) run each sample analysis 4) Finish second time indicating a specific sample or sample list : 1) read input data 2a) Check for cluster analysis folder in results > read file Or 2b) do cluster analysis, gender , coverage, etc... > stored as a file in the results folder 3) run just the sample 4) Finish in this case from this point forward every time the script was invoked with the same initial inputs, if the cluster analysis file was there that time would be skipped as it would only involve reading the cluster file that was already present in the results folder ADD REPLY 0 Entering edit mode Ah, I see. That's why I did not pack this tool as R package =) You may add "save.image()" to the beginning of https://github.com/imgag/ClinCNV/blob/master/germline/germlineSolver.R file and then use "load(name_of_saved_image)" and run just germlineSolver.R script with another opt$normalSample value. Somehow, libraries from the beginning of the main script have to be loaded too.

I used this mode for initial tuning of parameters / debugging , at the end, when you establish your parameters, you won't need this intermediate saving of the file. Once you add a new sample or remove a sample, you need to recalculate everything anyways.

0
Entering edit mode

Hi, My question regarding the somatic CNV is that can we prioritize them according to their functional consequences which correlate to their parameter by this tool.

I am asking because after knowing the CNVs how to use them in a biological context?

1
Entering edit mode

Hi Ravinsit06, we use cancer genome interpreter for the annotation. I can upload scripts that we use, you also can download the database from cgi website. https://www.cancergenomeinterpreter.org/home