Question: CNVkit import-rna : calculate cnv expression correlation using cBioportal
gravatar for noodle
3 months ago by
noodle30 wrote:

Hi there,

I have RNA sequencing data from patient matched primary breast tumour that have metastasised to distant organs such as brain and bone.

I would like to use the new import-rna feature in CNVkit to calculate copy number variations in these samples.

I had ran import-rna using: import-rna ./salmon-counts/*-bone-*.txt \
  --gene-resource cnvkit/data/ensembl-gene-info.hg38.tsv \
  --correlations cnvkit/data/tcga-skcm.cnv-expr-corr.tsv \
  --output bone-cnv-summary.tsv --output-dir out

In the documentation at, it states that:

The --correlations input is not required but is strongly recommended. The TCGA melanoma cohort correlations can be used for analysis of any tissue type, not just neoplastic melanocytes. However, best results will usually be achieved with a correlations table specific to the test cohort. The script generates this table from input tables of per-gene and per-sample copy number and expression levels, typically retrieved from cBioPortal for TCGA cancer-specific cohorts.

Therefore, I would like to use the script on TCGA BRCA data to pass as input for --correlations to import-rna.

Opening the python script it also states:

"""Get correlation coefficients for matched copy number and expression data.

cBioPortal offers a nice feature in which you can download a summary of many
large-scale sequencing studies. In this summary are two files that contain
the copy number and expression values of every gene in the study for every
sample.  This summary is available for nearly every TCGA study, and the data
is intuitive to access, therefore I have designed this pre-processing script
to accept these as inputs. Of course, the user can calculate their own
Pearson values from other sources of data if they prefer -- in this case,
the user should formate their data to match the output of this prepocessing

However, on the cBioPortal website and with the cgdsr R package you cannot download all the expression and CNV data with EntrezID for all genes.

What would be the best way to approach this?

I was thinking of using the RTCGAToolbox to pull the

tcga.brca <- RTCGAToolbox::getFirehoseData(dataset = "BRCA", 
                                           RNASeqGene = TRUE,
                                           RNASeq2GeneNorm = TRUE,
                                           CNASeq = TRUE,
                                           clinical = TRUE)

then use biomart to retrieve the entrez gene id for the HUGO gene symbols and use those files as input to the script.

Would that be the correct way to approach it?

Thanks a million!

rna-seq cnvkit • 231 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by noodle30

I got this to work.

It was not at first look very obvious where to get the summary data files from cbioportal but it is indeed there:

For TCGA BRCA (Breast Cancer)

I then ran the following command: python -o tcga-brca.cnv-expr-corr.tsv brca_tcga/data_CNA.txt brca_tcga/data_RNA_Seq_v2_expression_median.txt

ADD REPLYlink modified 3 months ago • written 3 months ago by noodle30
gravatar for noodle
3 months ago by
noodle30 wrote:

After more intense googling...

This is the file I should use?


ADD COMMENTlink modified 3 months ago • written 3 months ago by noodle30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1756 users visited in the last hour