I have RNA sequencing data from patient matched primary breast tumour that have metastasised to distant organs such as brain and bone.
I would like to use the new import-rna feature in CNVkit to calculate copy number variations in these samples.
I had ran import-rna using:
cnvkit.py import-rna ./salmon-counts/*-bone-*.txt \ --gene-resource cnvkit/data/ensembl-gene-info.hg38.tsv \ --correlations cnvkit/data/tcga-skcm.cnv-expr-corr.tsv \ --output bone-cnv-summary.tsv --output-dir out
In the documentation at https://cnvkit.readthedocs.io/en/stable/rna.html, it states that:
--correlations input is not required but is strongly recommended.
The TCGA melanoma cohort correlations can be used for analysis of any tissue type, not just neoplastic melanocytes.
However, best results will usually be achieved with a correlations table specific to the test cohort.
cnv_expression_correlate.py generates this table from input tables of per-gene and per-sample
copy number and expression levels, typically retrieved from cBioPortal for TCGA cancer-specific cohorts.
Therefore, I would like to use the
cnv_expression_correlate.py script on TCGA BRCA data to pass as input for
Opening the python script it also states:
"""Get correlation coefficients for matched copy number and expression data. cBioPortal offers a nice feature in which you can download a summary of many large-scale sequencing studies. In this summary are two files that contain the copy number and expression values of every gene in the study for every sample. This summary is available for nearly every TCGA study, and the data is intuitive to access, therefore I have designed this pre-processing script to accept these as inputs. Of course, the user can calculate their own Pearson values from other sources of data if they prefer -- in this case, the user should formate their data to match the output of this prepocessing script. """
However, on the cBioPortal website and with the cgdsr R package you cannot download all the expression and CNV data with EntrezID for all genes.
What would be the best way to approach this?
I was thinking of using the RTCGAToolbox to pull the
tcga.brca <- RTCGAToolbox::getFirehoseData(dataset = "BRCA", RNASeqGene = TRUE, RNASeq2GeneNorm = TRUE, CNASeq = TRUE, clinical = TRUE)
then use biomart to retrieve the entrez gene id for the HUGO gene symbols and use those files as input to the
Would that be the correct way to approach it?
Thanks a million!