Credits to Tiago Silva; https://f1000research.com/articles/5-1542
Update April 5, 2020:
To be clear, this pipeline has little to do with GISTIC. Here, the segmented copy number data that is obtained from Broad Institute is processed via an R package, gaia, which performs the same function as GISTIC, i.e., given a cohort of samples, identifies recurrent copy number events.
Update October 14, 2018:
This thread has become a sort of focal point, so, I wanted to make it absolutely clear the process to follow. If your goal is to obtain somatic copy number alterations (sCNA) for a group of TCGA patients and/or identify recurrent sCNA in these patients, then follow these steps:
- Part I - download segmented sCNA data for any TCGA cohort from Broad Institute's FireBrowse server and identify recurrent sCNA regions in these with gaia (R)
- Part II - plot recurrent sCNA gains and losses from GAIA
- Part III - annotate the recurrent sCNA regions (this post, just below)
- Part IV -
generate heatmap of recurrent sCNA regions over your cohort
----------------------
.
Part III
A
Create a reference dataset of all genes and save it as a GenomicRanges object
library(biomaRt)
library(GenomicRanges)
#Set up an gene annotation template to use
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
xidx <- which(genes[,2]=="X")
yidx <- which(genes[,2]=="Y")
genes[xidx, 2] <- 23
genes[yidx, 2] <- 24
genes[,2] <- sapply(genes[,2],as.integer)
genes <- genes[order(genes[,3]),]
genes <- genes[order(genes[,2]),]
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)
B
Store your own data as a GenomicRanges object:
NB - if following from Part II, the df
used here is the RecCNV
object, produced in Part II
colnames(df) <- c("chr", "start", "end", "extra1", "extra2")
df
chr start end extra1 extra2
1 51598 1500664 226 0.1646
1 1617778 1653196 12 -0.4115
1 1653256 15362197 7748 0.0056
df_GR <- makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
df_GR
GRanges object with 4 ranges and 3 metadata columns:
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
[1] 1 [ 51598, 1500664] * |
[2] 1 [ 1617778, 1653196] * |
[3] 1 [ 1653256, 15362197] * |
extra1 extra2
<integer> <numeric>
[1] 226 0.1646
[2] 12 -0.4115
[3] 7748 0.0056
-------
C
Overlap your regions with the reference dataset that you created
hits <- findOverlaps(genes_GR, df_GR, type="within")
df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
head(df_ann)
chr start end extra1 extra2 GeneSymbol Chr
1 51598 1500664 226 0.1646 OR4G4P 1
1 51598 1500664 226 0.1646 OR4G11P 1
1 51598 1500664 226 0.1646 OR4F5 1
Start End
52473 54936
62948 63887
69091 70008
D
To make filtering easier, we can further manipulate the data to show the precise co-ordinates of each gene and the region co-ordinates in which it was found to have CNA / CNV:
AberrantRegion <- paste0(df_ann[,1],":", df_ann[,3],"-", df_ann[,4])
GeneRegion <- paste0(df_ann[,7],":", df_ann[,8],"-", df_ann[,9])
Final_regions <- cbind(df_ann[,c(6,2,5)], AberrantRegion, GeneRegion)
Final_regions[seq(1, nrow(Final_regions), 20),]
extra2 chr extra1 AberrantRegion
0.1646 1 226 1:51598-1500664
0.1646 1 226 1:51598-1500664
0.1646 1 226 1:51598-1500664
0.1646 1 226 1:51598-1500664
GeneRegion
OR4G4P:1-52473
TUBB8P11:1-808672
FAM132A:1-1177826
TMEM240:1-1470554
Hi,
I want to check for the frequency of CNV (amplification or deletion) in the given cancer datasets from TCGA. please kindly help, suggest tool or method, Thanks is advance.
Maftools does that pretty easily
Hello, Thank you for your kind response. I want to check the frequency of particular deletion/amplification in no of patients.(like: how many patients have same gene deleted or amplified, exact no of patients). please suggest. thank you
Do you have GISTIC data? If you have that the exact number of altered samples is very simple by maftools If you have segment data, by annotating the genes in copy number range again you can say each specific gene has been altered in which samples
Hello A, I am trying to do CNV analysis. I have gistic 2 and segnment files. I have no idea how to do the analysis. I want to copmare cnv between two groups. Can you help me with this. Is this possible to share the r code for maftools? Thanks
If you can, please follow this: https://f1000research.com/articles/5-1542
...or my code, starting from here: C: How to extract the list of genes from TCGA CNV data
Thank you Kevin, I do not know how to customize these to my data. I have all primary tumor (not normal) and I want to do CNV to compare between females and males. can you help me with this please? Thanks
I am sorry. I am obviously quite limited in how I can help you from my current position. You will have to help me so that I help you, if you understand me.
Hi Kevin,
I download TCGA BRCA CNV data and ran your R codes in part A, B, and C and also generated results.all files, however, R showed
the results file looks correct, with 6 columns and 6862 rows, in the aberration kind, 1=gain and 0=loss of CNV?
thank you very much!!
Yuanchun Ding from California of USA
Hi Yuanchun, I am not sure about this warning message.
Yes, 0 is loss, and 1 is gain. These are encoded in Part C, here: C: How to extract the list of genes from TCGA CNV data