Tutorial: Tutorial: Analyze exome Copy number variation (CNV) in single patient or in population.
gravatar for Chirag Nepal
5.2 years ago by
Chirag Nepal2.2k
Chirag Nepal2.2k wrote:

I have spend sometime trying to understand/analyze exome CNV (copy number variation). There are many tools to do so, but here i am using varscan2, DNACopy and  GISTIC.

Varscan2: http://varscan.sourceforge.net/index.html

DNACopy: http://www.bioconductor.org/packages/2.3/bioc/html/DNAcopy.html

Gistic2: ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm



Step-1: Run varscan2 on normal-tumor-mpileup and make copy-number call.

java -jar $varscan copynumber normal-tumor-mpileup $outputFile --mpileup 1 --p-value 0.005 --min-coverage 30 --min-map-qual 20 --min-base-qual 20 --data-ratio Calculate_From_Data

The output file has suffix ".copynumber", for eg, "filename.copynumber". It contains all the segments above certain threshold, which are generally in large number, ranging close to 1 million per file. The output should further be processed to find significant amplified/deleted CNA or detect reoccurrent SNVs in large cohort. We will first do it for single T/N pair and on the population. Newer version of samtools mpileup file might throw an error while running copynumber.


Step-2: Use CBS binary segmentation in R to find segmented region. Use mergeSegment.pl (provided by varscan) to merge them.

P1 = read.table ("filename.copynumber", header=TRUE)
CNA.object <- CNA( genomdat = P1[,7], chrom = P1[,1], maploc = P1 [,2], data.type = 'logratio')
CNA.object.smoothed     <- smooth.CNA(CNA.object)
CNA.object.smoothed.seg <- segment(CNA.object.smoothed, verbose=0, min.width=2)
seg.pvalue <- segments.p(CNA.object.smoothed.seg, ngrid=100, tol=1e-6, alpha=0.05, search.range=100, nperm=1000)
write.table (seg.pvalue,       file="cbs_P1_pvalue", row.names=F, col.names=F, quote=F, sep="\t")

# This will not print the p-value.
#segmented = CNA.object.smoothed.seg$output
#write.table (segmented[,2:6], file="cbs_P1",        row.names=F, col.names=F, quote=F, sep="\t")


  • Step-2.1: Input the output file to mergeSegments.pl [downloaded from varscan2]. Note: mergeSegment.pl needs an additional column-2, where "chrom" needs to inserted.
cat cbs_P1_pvalue | awk 'BEGIN {OFS="\t"} { print $1,"chrom",$2,$3,$4,$5,$6,$7,$8,$9,$10 }' > formattedInputFile

perl mergeSegment.pl formattedInputFile --ref-arm-sizes $refArmSize --output-basename filteredOutput

Now, from the output file, significant CN amplified or deleted regions can be detected.

cat filteredOutput | grep -e amplification -e deletion 

cat filteredOutput | grep large-scale

cat filteredOutput | grep neutral

Have a closer look at the output file. The output files also reports few regions which are neutral and few large scale deletions.


Step-3: Now, let's try to find recurrent CNV among population, for which i use varscan and GISTIC. The output from varscan copynumber is feed to GISTIC. To run GISTIC, you need to make segment file and marker position. Also refer to the GISTIC documentation ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm



  • Step-3.1: Make marker position
cat filename.copynumber | grep -v -e chrom -e chrM -e chrX -e chrY | awk 'BEGIN {OFS="\t"} { print "ID'",$1,$2 }' | sed 's/chr//g' > markerPosition


  • Step-3.2: Make segment file. The output from mergeSegment.pl, which in this example with start with suffix "filteredOutput' should be used to make segment file.

cat $name | awk 'BEGIN {OFS="\t"} { print "ID",$2,$3,$4,$5,$6 }' | grep -v -e chrM -e chrX -e chrY | sed 's/chr//g' > segmentedFile


Step-4: Run GISTIC. Please read the documentation on how to install/execute GISTIC.



gp_gistic2_from_seg -b $basedir -seg $segfile -mk $markersfile -refgene $refgenefile -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme


Step-5: Check the output plots. The amplified and deletion plots and the output text files are ready.


From output file of GISTIC, read information from this site.




Please feel free to contribute. I will update when i find better solution.

ADD COMMENTlink modified 3.1 years ago by brouse38150 • written 5.2 years ago by Chirag Nepal2.2k
Do not use on-target data, use off-target data! See Revolution (?) in CNA detection using exome/targeted sequencing
ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by Irsan7.2k

I edited the best I could to make this more readable. Though, you might want to add some more links/background on the approach and explain some of the variables being used (just a suggestion).

ADD REPLYlink written 5.2 years ago by SES8.4k

I think $name at step 3.2 need to be replaced with actual file name.

ADD REPLYlink written 5.2 years ago by Sachin Pundhir100

I use Varscan on exome data, so it seems to me that I should the exome target list as Marker list - maybe by taking the exon mid point (since markerPosition only allows one coordinate). Thoughts?

Incidentally, I've found that using the exome target list as whitelist in samtools mpileup greatly reduces noise in the Varscan output.

ADD REPLYlink written 5.1 years ago by jeltje.van.baren80

Can i use this method on WGS data?

ADD REPLYlink written 4.6 years ago by rse90

It would be very inefficient. I'd recommend CNVnator instead for WGS.

ADD REPLYlink written 4.6 years ago by Eric T.2.6k

Thank you. After that run GISTIC for CNAs?

ADD REPLYlink written 4.6 years ago by rse90

GISTIC is normally used to find recurrent CNV among patients, and there are other tools as well to find recurrent CNVs. First, you need to find enriched CNV segments for each patient, using anytool (CNVnator or any other tools), then you can use GISTIC2 to find recurrent CNV.

ADD REPLYlink written 4.6 years ago by Chirag Nepal2.2k

Thank you

But since GISTIC need segmentation file and marker file which require number of markers to be filled in the segmentation file and the marker position in the marker file. Seg.CN i can get from the CNV segments from CNVnator. So, was thinking if CNVnator output provides these information.

ADD REPLYlink written 4.6 years ago by rse90

There are many tools to find recurrent CNVs but GISTIC is the most popular i guess

ADD REPLYlink written 4.6 years ago by rse90

Hi rse, did you figure out a way to run GISTIC2 after calling CNVs with CNVnator?

ADD REPLYlink written 4.5 years ago by lcl0

No, i could not run GISTIC2

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by rse90

Thanks for your process.I have a question,could please interpret why should exclude chrM chrY chrX in the gistic analysis?

ADD REPLYlink written 3.8 years ago by reany0


Could I use copy number in .csv format from caveman for gistic input?

ADD REPLYlink written 21 months ago by A3.9k

Thank you for providing this tutorial as I found tailoring this together is very hard for me

Could I ask from where I can provide refArmSize please?

ADD REPLYlink modified 20 months ago • written 21 months ago by A3.9k
gravatar for wzlnot
5.0 years ago by
wzlnot20 wrote:

Replenish some detail and suggests:

1)In DNAcopy output part, the col.names and  row.names suggested to set Ture . The mergeSegments.pl can hangdle this. 


2)the chromosome name between --ref-arm-sizes file and DNAcopy output should be consistent. if not, the mergeSegments.pl will malfunction.

3)filter the un-located scaffolds or mtDNA if them exist. mergeSegments.pl assign the hash key with the chromosome name in --ref-arm-size file, and call hash value by use the chromsome name in DNAcopy output. If chromsome of DNAcopy is not in  --ref-arm-size file, then cause the "Use of uninitialized value ..." error.

so, suggest:

cat CNA.smoothed.segs.pvalue | awk -F "\t" '$3~/chr[1-9,XY]/{print}' > mergeSegments.input
ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by wzlnot20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1965 users visited in the last hour