Question: About TCGA CNV data preprocessing
1
gravatar for Dylan
4 months ago by
Dylan30
Dylan30 wrote:

Hello all, I am looking at the Level 3 CNV files on TCGA. I have a few questions:

I download copy number variance data from TCGA database and mapped genomic regions to gene symbols using this method(https://www.biostars.org/p/311199/#311746). Now i get a matrix that its rows are genes and its columns are samples. If I want to use this data to cluster samples, how do I pre process the data? (p.s. for probes mapped to the same gene, I averaged their Segment_mean values, right?)

Thanks for any help,

cnv snp • 553 views
ADD COMMENTlink modified 4 months ago by Kevin Blighe37k • written 4 months ago by Dylan30
3
gravatar for Kevin Blighe
4 months ago by
Kevin Blighe37k
Republic of Ireland
Kevin Blighe37k wrote:

Hey,

You are referring to a post that I made. From where did you obtain the original data? - Broad Firebrowse (GISTIC 2.0-identified somatic copy number alterations) or just downloaded the original files from GDC?

If you followed the data processing exactly as follows:

Then, the statistically significant recurrent somatic copy number alterations (sCNA) are held in the *.igv.gistic files. You can extract statistically significant regions from this file and then pull out the original copy number over these on a per sample basis using GenomicRanges - the copy number that you take is indeed the Segment Mean from the original copy number identified by GISTIC 2.0 (or some other tool that outputs a segment mean).

If you do that, then you can build a matrix of:

  • statistically significant recurrent sCNAs in a group of patients as rows
  • patients as columns
  • Segment Mean over each region as the values

With that, I generated this and identified clusters of patients based on recurrent sCNA via Partitioning Around Medoids (PAM)::

fgh

[from: https://www.ncbi.nlm.nih.gov/pubmed/29682207]

Of course, you don't have to use that data, exactly, but you really have to know to what your data relates.

Kevin

ADD COMMENTlink written 4 months ago by Kevin Blighe37k

Hi, Thanks for your help! I just downloaded the CMV data from TCGA database and mapped genomic regions to gene symbols using the following code:

mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)
df = read.csv("BRCA-CNV.csv") # CNV data downloaded from TCGA database.
cnv =  makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
hits <- findOverlaps(genes_GR, cnv, type="within")
df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
df_ann <- unique(df_ann)
write.table(df_ann, file="BRCA-CNV-withGeneSymbol.txt", quote = F, sep = "\t", row.names = F)

Then I constructed the CNV matrix using the following code:

cnv <- as.matrix(read.table("BRCA-CNV-withGeneSymbol.txt"))
cnv <- cnv[,c(7:9)]
samples.tumor <- samples[which(substr(samples, 14, 16) %in% c("01A", "01B", "06A"))]
tumor <- matrix(0, nrow = length(genes), ncol = length(samples.tumor))
colnames(tumor) <- samples.tumor
rownames(tumor) <- genes
for(i in 1:nrow(tumor)){
    for(j in 1:ncol(tumor)){
        gene.tmp <- rownames(tumor)[i]
        sample.tmp <- colnames(tumor)[j]
        cnv.tmp <- cnv[which(cnv[,3] %in% gene.tmp),]
        tumor[i,j] <- mean(as.numeric(cnv.tmp[which(cnv.tmp[,2] %in% sample.tmp),1]), rm.na = T)
    }
}

At last, I got the CNV matrix. Is there any error in this process? The final data is like this:

                                gene1      gene2       gene3       gene4      gene5
TCGA-LD-A9QF-01A-32D-A41E-01 -0.3258360  0.2291041 -0.29244881  0.4456383  0.2990137
TCGA-EW-A1OX-01A-11D-A141-01  5.4529917 -0.6737190 -0.17351193  0.2469372  0.4106143
TCGA-A2-A0YT-01A-11D-A107-01  0.5855575 -1.4167641 -0.09580651 -1.9142909 -0.7032367
TCGA-E2-A14N-01A-31D-A134-01 -0.5166007 -0.4208664  0.64834184  0.1254564  0.3653708
TCGA-BH-A18F-01A-11D-A12A-01 -1.4072964 -0.5098043  0.22809823  0.1885511 -0.1663476
ADD REPLYlink modified 4 months ago • written 4 months ago by Dylan30

You just took the tumour samples, merged them together, and then summarised copy number per gene? There's nothing wrong with that; however, if you try to publish it, a reviewer would ask if these copy number events are somatic or germline. Without subtracting the normal copy number profiles, you cannot know whether or not these are true somatic events. It depends on what exactly you want to do, of course.

  • Germline copy number event = copy number variation
  • Somatic copy number event = copy number alteration

The last time that I checked, the open access (level 3) TCGA copy number data is output per sample and is available separately for tumour and normal. This is the benefit of just taking the Broad Firebrowse data, i.e., because they have processed this level 3 data to produce somatic copy number events.

ADD REPLYlink modified 4 months ago • written 4 months ago by Kevin Blighe37k
1

I have downloaded the masked CNV data filtered germline variation as input data and get the average of all probes per gene. If I just want to use CNV data for cluster analysis, is this right? Thank you for your sharing. Now I know the difference between variation and alteration. Thanks again!

ADD REPLYlink written 4 months ago by Dylan30

Hey, it was not I who up-voted your comment. I am reading your comment for the first time right now. If that is the data that you have, then it seems okay!

ADD REPLYlink written 4 months ago by Kevin Blighe37k

@Kevin Blighe Hello Kevin, if i want to extract the genes from both tumor and normal samples That will make more sense instead or no

Actually, my data have "01A", "10A and "11A" and i also want to built the CNV matrix, I tried @Django code but an error is shown?

ADD REPLYlink written 3 months ago by Chaimaa150

@Kevin Blighe Hello Kevin Could you plz answer my question?

ADD REPLYlink written 3 months ago by Chaimaa150
1

You have not shown the error message, so, how can I advise further?

ADD REPLYlink written 3 months ago by Kevin Blighe37k

@Kevin Blighe Hi, Kevin sorry for that here is the piece of code i tried and the following error

    cnv <- as.matrix(read.table("CNV-withGeneSymbol.txt"))
    cnv <- cnv[,c(7:9)]
    samples <- gsub("TCGA-[A-Z0-9]*-[A-Z0-9]*-", "", (gsub("-[0-9A-Z]*-[0-9A-Z]*-01$", "", cnv[,1])))
    samples.tumor <- samples[which(substr(samples, 14, 16) %in% c("01A", "01B", "06A"))]
    tumor <- matrix(0, nrow = length(genes), ncol = length(samples.tumor))
    colnames(tumor) <- samples.tumor
    rownames(tumor) <- genes
    for(i in 1:nrow(tumor)){
        for(j in 1:ncol(tumor)){
            gene.tmp <- rownames(tumor)[i]
            sample.tmp <- colnames(tumor)[j]
            cnv.tmp <- cnv[which(cnv[,3] %in% gene.tmp),]
            tumor[i,j] <- mean(as.numeric(cnv.tmp[which(cnv.tmp[,2] %in% sample.tmp),1]), rm.na = T)
    }
    }

Error shown here; Error in [<-(*tmp*, i, j, value = mean(as.numeric(cnv.tmp[which(cnv.tmp[, : subscript out of bounds

ADD REPLYlink written 3 months ago by Chaimaa150
1

Hey, thank you! Can you show the first few lines and columns of the object cnv? Just paste them here, highlight them, and then click the 101 010 button

ADD REPLYlink written 3 months ago by Kevin Blighe37k

@Kevin Blighe, I'm working on RGui(64-bit) and the objects are not shown here. I have posted the full code i applied, and i can share the data as well if y want ?

library(GenomicRanges)
> mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
> mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
> genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
> genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
> xidx <- which(genes[,2]=="X")
> yidx <- which(genes[,2]=="Y")
> genes[xidx, 2] <- 23
> genes[yidx, 2] <- 24
> genes[,2] <- sapply(genes[,2],as.integer)
> genes <- genes[order(genes[,3]),]
> genes <- genes[order(genes[,2]),]
> colnames(genes) <- c("GeneSymbol","Chr","Start","End")
> df<- read.table("E:/results/Stage1_CNV_data_v2.txt", sep="\t", stringsAsFactors=FALSE, header=TRUE)
> colnames(df) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")
> cnv <-  makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
> hits <- findOverlaps(genes_GR, cnv, type="within")
> df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
> df_ann <- unique(df_ann)
> write.table(df_ann, file="outputfile-GeneSymbol.txt", quote = F, sep = "\t", row.names = F)
> cnv <- as.matrix(read.table("outputfile-GeneSymbol.txt"))
> cnv <- cnv[,c(7:9)]
> samples <- gsub("TCGA-[A-Z0-9]*-[A-Z0-9]*-", "", (gsub("-[0-9A-Z]*-[0-9A-Z]*-01$", "", cnv[,1])))
> samples.tumor <- samples[which(substr(samples, 14, 16) %in% c("01A", "01B", "06A"))]
> tumor <- matrix(0, nrow = length(genes), ncol = length(samples.tumor))
> colnames(tumor) <- samples.tumor
> rownames(tumor) <- genes
> for(i in 1:nrow(tumor)){
+     for(j in 1:ncol(tumor)){
+         gene.tmp <- rownames(tumor)[i]
+         sample.tmp <- colnames(tumor)[j]
+         cnv.tmp <- cnv[which(cnv[,3] %in% gene.tmp),]
+         tumor[i,j] <- mean(as.numeric(cnv.tmp[which(cnv.tmp[,2] %in% sample.tmp),1]), rm.na = T)
+ }
+ }
Error in `[<-`(`*tmp*`, i, j, value = mean(as.numeric(cnv.tmp[which(cnv.tmp[,  : 
  subscript out of bounds
ADD REPLYlink written 3 months ago by Chaimaa150

Here is the data

ADD REPLYlink written 3 months ago by Chaimaa150
1

I am not exactly sure what you are aiming to do with this for loop. However, the way to debug them is, after they crash and produce the error, check the last value of i and j, and then take a look at those indices in tumor and cnv. Then you will be able to debug it.

ADD REPLYlink written 3 months ago by Kevin Blighe37k

@Kevin Blighe hi Kevin, So, i have to change the values of i and j and then run the for loop again or what? I didn't get exactly what do you mean by taking a look and then debug it again?

Actually, i want to construct the CNV matrix (rows represent the samples and columns represent the annotated genes) .

Exactly, i want to integrate this data with others multi-omics datasets such as gene expression and DNA methylation for further analysis. All these datasets have such format except CNV data which i get stuck to reformat it now. That's why i used that for loop to get the matrix and maybe after that i can merge all of these datasets together. Is this for loop right or no to make such matrix?

ADD REPLYlink written 3 months ago by Chaimaa150
1

Yes, but what is the value of i and j after the error is produced?

You may need to do the following:

  1. define a unique list of regions (chr:start:end) across all of your samples
  2. create an empty data-frame with just the unique list of regions as row-names - this data-frame will eventually hold your final output
  3. loop over your input data per sample (i), and output 'segment mean' where there is overlap with each segment (j). Overlap can be determined by GenomicRanges
ADD REPLYlink written 3 months ago by Kevin Blighe37k

@Kevin Blighe Hi Kevin, Thanks for yr quick reply! your helpful suggestions!

The last values of i and j are 1L.

I want my output like a matrix of samples rows and genes in the columns with their corresponding segment values.

I'm more familiar with MATLAB, i used the output file produced from your annotation code and i write a piece of code that can allow allocating this matrix but my code there doesn(t work too. I tried the code in R which i share with you yesterday and also no results!

ADD REPLYlink written 3 months ago by Chaimaa150

@Kevin Blighe Hi Kevin, i have checked as you said but i couldn't fix it This is the first few lines and columns of the object cnv

 1
    GeneSymbol
    Chr
    Start
    2
    ARHGEF16
    1
    3370990
    3
    ARHGEF16
    1
    3370990
    4
    ARHGEF16
    1
    3370990
    5
    ARHGEF16
    1
    3370990
    6
    ARHGEF16
    1
    3370990
    7
    ARHGEF16
    1
    3370990

I already explain to you why i used this for Loop, hope you can help me plz!

ADD REPLYlink written 3 months ago by Chaimaa150

@Django i try your code but an error is shown Error in [<-(*tmp*, i, j, value = mean(as.numeric(cnv.tmp[which(cnv.tmp[, : subscript out of bounds Can you help me to fix it ?

ADD REPLYlink written 3 months ago by Chaimaa150
1

I think the Error code means there is no element in "sample.tmp". And which() will return nothing.

ADD REPLYlink written 3 months ago by Dylan30

@Django Itried your code and i think it has some missing line before samples, where is the object samples? could y plz fix it

ADD REPLYlink modified 3 months ago • written 3 months ago by Chaimaa150

Hello @Kevin Blighe can you plz tell me how to open this file "Tumor.All.txt.igv.gistic" and how to plot the heatmap using which package( ComlpexHeatmap package in your paper) (Can you plz provide the code for this step too).

With my respect to you!

ADD REPLYlink written 4 months ago by Chaimaa150
1

Hey bro, you can read back in the file like this:

require(GenomicRanges)
recCNV.Template.GR <- makeGRangesFromDataFrame(read.table("Tumor.All.txt.igv.gistic", sep="\t", header=TRUE), keep.extra.columns=TRUE)

This reads it into a GenomicRanges object.

Do you also have an object called CN.tumor in your R session?

ADD REPLYlink written 4 months ago by Kevin Blighe37k

@Kevin Blighe yes it should be there bcz i just follow your whole process step by step and everything work smoothly but only with FFPE file.

ADD REPLYlink written 4 months ago by Chaimaa150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1102 users visited in the last hour