Question: How to extract the list of genes from TCGA CNV data
0
gravatar for Chaimaa
22 months ago by
Chaimaa160
Chaimaa160 wrote:

Hi guys, I have got a CNV data from TCGA as shown below and my goal to extract the list of the genes by using GISTIC Could anyone plz told me how can i deal with this data by using GISTIC and how to apply GISTIC.

  • TCGA-2A-A8VL-10A-01D-A379-01 1 51598 1500664 226 0.1646
  • TCGA-2A-A8VL-10A-01D-A379-01 1 1617778 1653196 12 -0.4115
  • TCGA-2A-A8VL-10A-01D-A379-01 1 1653256 15362197 7748 0.0056
  • TCGA-2A-A8VL-10A-01D-A379-01 1 15362212 15362449 6 0.7626

https://postimg.cc/image/g4kusd56j/

cnv genes tcga • 5.5k views
ADD COMMENTlink modified 22 months ago by Kevin Blighe55k • written 22 months ago by Chaimaa160
12
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe55k
Kevin Blighe55k wrote:

Credits to Tiago Silva; https://f1000research.com/articles/5-1542

Update October 14, 2018:

This thread has become a sort of focal point, so, I wanted to make it absolutely clear the process to follow. If your goal is to obtain somatic copy number alterations (sCNA) for a group of TCGA patients and/or identify recurrent sCNA in these patients, then follow these steps:

  • Part I - download segmented sCNA data for any TCGA cohort from Broad Institute's Firebrowse server and identify recurrent sCNA regions in these with GAIA
  • Part II - plot recurrent sCNA gains and losses from GAIA
  • Part III - annotate the recurrent sCNA regions (this post, just below)
  • Part IV - generate heatmap of recurrent sCNA regions over your cohort

Credits: TCGAbiolinks

----------------------

--------------------

.

Part III

A

Create a reference dataset of all genes and save it as a GenomicRanges object

library(biomaRt)
library(GenomicRanges)

#Set up an gene annotation template to use
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
xidx <- which(genes[,2]=="X")
yidx <- which(genes[,2]=="Y")
genes[xidx, 2] <- 23
genes[yidx, 2] <- 24
genes[,2] <- sapply(genes[,2],as.integer)
genes <- genes[order(genes[,3]),]
genes <- genes[order(genes[,2]),]
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)

B

Store your own data as a GenomicRanges object:

NB - if following from Part II, the df used here is the RecCNV object, produced in Part II

colnames(df) <- c("chr", "start", "end", "extra1", "extra2")
df
chr    start      end extra1  extra2
  1    51598  1500664    226  0.1646
  1  1617778  1653196     12 -0.4115
  1  1653256 15362197   7748  0.0056

df_GR <- makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
df_GR
GRanges object with 4 ranges and 3 metadata columns:
      seqnames               ranges strand |                      
         <Rle>            <IRanges>  <Rle> | 
  [1]        1 [   51598,  1500664]      * | 
  [2]        1 [ 1617778,  1653196]      * | 
  [3]        1 [ 1653256, 15362197]      * | 
         extra1    extra2
      <integer> <numeric>
  [1]       226    0.1646
  [2]        12   -0.4115
  [3]      7748    0.0056
  -------

C

Overlap your regions with the reference dataset that you created

hits <- findOverlaps(genes_GR, df_GR, type="within")
df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
head(df_ann)
chr   start     end extra1 extra2 GeneSymbol Chr
  1   51598 1500664    226 0.1646     OR4G4P   1
  1   51598 1500664    226 0.1646    OR4G11P   1
  1   51598 1500664    226 0.1646      OR4F5   1
Start    End
52473  54936
62948  63887
69091  70008

D

To make filtering easier, we can further manipulate the data to show the precise co-ordinates of each gene and the region co-ordinates in which it was found to have CNA / CNV:

AberrantRegion <- paste0(df_ann[,1],":", df_ann[,3],"-", df_ann[,4])
GeneRegion <- paste0(df_ann[,7],":", df_ann[,8],"-", df_ann[,9])
Final_regions <- cbind(df_ann[,c(6,2,5)], AberrantRegion, GeneRegion)
Final_regions[seq(1, nrow(Final_regions), 20),]
extra2 chr extra1     AberrantRegion
0.1646   1    226    1:51598-1500664
0.1646   1    226    1:51598-1500664
0.1646   1    226    1:51598-1500664
0.1646   1    226    1:51598-1500664

       GeneRegion
   OR4G4P:1-52473
TUBB8P11:1-808672
FAM132A:1-1177826
TMEM240:1-1470554
ADD COMMENTlink modified 5 hours ago • written 22 months ago by Kevin Blighe55k

@Kevin Blighe Thank you too much for your reply but I'm not familiar with the language of your source code and I'm sure that I have to use GISTIC as mentioned in many papers that i have already read in order to extract a more reliable list of genes and their loci

ADD REPLYlink written 22 months ago by Chaimaa160

if anyone familiar with GISTIC plz give me some guidance

ADD REPLYlink written 22 months ago by Chaimaa160
1

Please explain the exact source from where you downloaded your data. There are various types of copy number data in various states of processing available from different web-sites. The data is usually available as somatic copy number alteration data (SCNA), not copy number variation (CNV) - CNVs are more spoken of in terms of the germline and 'natural' variation in copy number. GISTIC is just one of a few programs that are commonly used to analyse SCNA data for the purposes of 'summarising' the regions by grouping them into large regions more amenable for downstream analysis and interpretation.

Have you taken a look at the documentatation to see what exactly you require to execute the program? - ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm

The code that I put above is code using base R functions (with the exception of biomaRt) that can be used to identify the genes overlapping any type of segment data (chr : startbp : endbp).

Thank you.

ADD REPLYlink written 22 months ago by Kevin Blighe55k
1

To give you an idea, take a look my most recent publication where a main part was to analyse the somatic CNA ata that is available. I used a mixture of programs to do it, including GISTIC2.0: Racial differences in endometrial cancer molecular portraits in The Cancer Genome Atlas.

Look at the section headed 'Somatic mutation and copy number aberrations' in the mathods

ADD REPLYlink written 22 months ago by Kevin Blighe55k

@Kevin Blighe Thanks a lot for your interest and quick reply.

ADD REPLYlink written 22 months ago by Chaimaa160

@Kevin Blighe, I checked your paper and thread through the section" Somatic mutation and copy number aberrations' in the methods"

I got The CNV data level3 from this website:http://gdac.broadinstitute.org and specifically from this link http://firebrowse.org/?cohort=PRAD&download_dialog=true.

I'm interested in getting the list of the significant genes from this data by using GISTIC.

As i know from many papers Gistic2.0 can be used to analyze the copy number dataset (Level 3) for the identification of recurrent regions of copy number alteration and the copy number of genes, and this my goal too.

So the data i got have no gene names.

Additionally, i found that the link from where i got my data have GISTIC results http://firebrowse.org/?cohort=PRAD&download_dialog=true# but i really don't know which file i will download and how to deal with them.

Finally, my main goal is to get a matrix of m*n in which m represent the list of patients(Samples) and n represent the list of genes from this dataset.

@Kevin Blighe Plz, i hope you can help me out to do this job, and i really really appreciate your help!

ADD REPLYlink modified 5 months ago by Kevin Blighe55k • written 22 months ago by Chaimaa160
1

No problem.

Yes, recurrent somatic copy number alterations (recSCNA) are the target (for analysis), and the genes that overlap these. I am almost certain that I also started with the data from Broad Firebrowse, but I will double check when I arrive home.

I will go through the analysis steps one-by-one for you.

ADD REPLYlink modified 15 months ago • written 22 months ago by Kevin Blighe55k

Sorry @Kevin, my question is silly but how I could install GISTIC?

ADD REPLYlink written 11 months ago by A3.7k
1

Hey, the information can be found here: http://portals.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=216&p=t

ADD REPLYlink written 11 months ago by Kevin Blighe55k

@ Kevin Blighe how to save these final regions into a file?

ADD REPLYlink written 16 months ago by Chaimaa160
1

Hey, you can do something like:

write.table(Final_regions, "FinalRegions.csv", sep=",", quote=FALSE, row.names=FALSE)
ADD REPLYlink written 16 months ago by Kevin Blighe55k

So in this case is the df used in part A-D have only Tumor copy number data or normal as well? Thank you very much for your many great posts!

ADD REPLYlink written 13 months ago by jschombe0

Hey, you're welcome. This post (above) is purely about annotating any type of segment data (requires minimum of chr start end). In the context of the related posts, it is recurrent somatic copy number alteration (recurrent sCNA) segments that have had the normal subtracted out. The pipeline is a mess but goes in this order:

ADD REPLYlink written 13 months ago by Kevin Blighe55k

I have completed the first part of the pipeline and produced tumor.all.igv.gistic. However this file does not contain individual data rather it gives me a measure of the significance of the copy number alteration for regions of the genome. To produce a FinalRegions.csv file that matches yours should I be using the cn.tumor dataframe (as shown in Part I) as the dataframe I submit in the code "df_GR <- makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)"? Could you also tell me how you restructured your CNV data from the structure of df_ann or GenomicRegions to a matrix that can be read into the ComplexHeatmap package which requires barcode as columns and genomic regions as rows? Thanks very much for your help with this.

ADD REPLYlink modified 13 months ago • written 13 months ago by jschombe0

I see. The df in the code above is the same object as RecCNV from Part II. Do you have that object? I will update the code above to reflect this.

The files that are output by the runGAIA function are the ones that I use in Part IV to produce the heatmaps, but I have not put the code on Biostars (it is complex...).

ADD REPLYlink written 13 months ago by Kevin Blighe55k

I understand. I will save RecCNV to file and then run the code producing FinalRegions. Will the data structure of FinalRegions be barcodes as columns and genes as rows? If not, do you recommend using the reshape package to produce such a matrix? Thanks very much for your help with this.

ADD REPLYlink written 13 months ago by jschombe0

In my project code, the output is like this:

GeneSymbol  Aberration  q-value AberrantRegion  GeneRegion
RNU6-904P   Del 0.0280807421052632  2:141915235-142013612   2:141924826-141924926
PRR14   Amp 0.0684173368421053  16:30618633-30986671    16:30662038-30667761
FBRS    Amp 0.0684173368421053  16:30618633-30986671    16:30669752-30682135
RNU6-416P   Amp 0.0684173368421053  16:30618633-30986671    16:30686621-30686723
SRCAP   Amp 0.0684173368421053  16:30618633-30986671    16:30709530-30755602
RNU6-1043P  Amp 0.0684173368421053  16:30618633-30986671    16:30712658-30712756
SNORA30 Amp 0.0684173368421053  16:30618633-30986671    16:30721858-30721986
PHKG2   Amp 0.0684173368421053  16:30618633-30986671    16:30759591-30772490

My project code is different from the above, slightly, because I originally gave the above answer as an independent answer. That was my worry about linking up these steps on Biostars, i.e., they were each independent answers and don't 'harmoniously' link up together. Even a moderately skilled R user should be able to get the exact data that they want, though.

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe55k

Sorry Kevin

I know my question is too non-sense

I was given a list of files (one for each patient), for example like this

LP2000104-DNA_A01_vs_LP2000101-DNA_A01.copynumber.caveman.csv

Inside them I have

Chromosome  Start   End Total_CN    Minor_CN
1   10583   1017587 3   1
1   1018144 2466425 5   2
1   2475113 7770901 3   1

Likely I should find genes under positive selection; I looked at GISTIC documentation but what I have does not look like what GISTIC needs at all.

What do you think please? I mean you think I am at which part of CNV analysis by these files?

Thank you for any help

ADD REPLYlink modified 12 months ago • written 12 months ago by A3.7k

From where did you obtain the files?

ADD REPLYlink written 12 months ago by Kevin Blighe55k

From our collaborator in Cambridge

ADD REPLYlink written 12 months ago by A3.7k

Cambridge, England or Cambridge, Massachusetts, USA? In either case, this is not quite what I was looking for... I was implying about the ultimate source of the files, i.e., how they were produced. You should ask the collaborator. The data seems to be non-standard format, but could be implemented in the CNV process that I use in this thread.

ADD REPLYlink written 12 months ago by Kevin Blighe55k

Cambridge in England. They processed their own data and sent us our requested samples. They published their results here. The question behind my files is different with what they have done in paper.

The landscape of selection in 551 esophageal adenocarcinomas defines genomic biomarkers for the clinic
ADD REPLYlink written 12 months ago by A3.7k

Sorry,

My files look the same as the input OP provided screen shot of that only my last column is minor _CN differs with OP's file

ADD REPLYlink written 12 months ago by A3.7k

@Kevin

When I try to keep my sample names with my seg file like below

sample  chr start   end marker_num  seg_mean
s1  1   136962  534547  11  1.311879592
s23 1   562020  672823  16  -0.010048503
s3  1   672948  2692196 1310    0.574913998

Your tutorial finally gives like this some mess

> head(Final_regions[1:2,1:5])
            seg chr marker                  AberrantRegion
9185 -0.4132345   1   3526 s1:10583-5726928
5591  0.1215680   1   3525 s2:10583-5725898
          GeneRegion
9185 DDX11L1:1-11869
5591 DDX11L1:1-11869
>

How I can keep sample name from the scratch without disturbing your code?

ADD REPLYlink written 4 weeks ago by A3.7k
1

You should use my code as a guide to help you through this analysis, but please make modifications where you see that they are required.

ADD REPLYlink written 4 weeks ago by Kevin Blighe55k
4
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe55k
Kevin Blighe55k wrote:

Credits to Tiago Silva; https://f1000research.com/articles/5-1542

Hello, so, here are the steps that I did, for endometrial cancer:

First downloaded copy number data from here: http://firebrowse.org/?cohort=UCEC# (access the data by clicking on the green bar for SNP6 CopyNum). Specifically, the file that you should obtain is for hg19 and will have 'scna' and 'minus_germline_cnv' as parts of its filename.

In my case, the file was called UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt

Then, there are a series of steps that you should do in order to process this data:

Part I

A

separate out the tumour from nomals

require(data.table)
CN <- read.table("UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt", sep="\t", stringsAsFactors=FALSE, header=TRUE)

#   Tumor types range from 01 - 09
#   Normal types from 10 - 19
#   Control samples from 20 - 29
types <- gsub("TCGA-[A-Z0-9]*-[A-Z0-9]*-", "", (gsub("-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN[,1])))
unique(types)

# NB - in the next 2 lines, it is important to account for all types that are listed from unique(types) command
matched.normal <- c("10A", "10B", "11A", "11B")
tumor <- c("01A", "01B", "06A")

#Divide up the data into tumor and normal
tumor.indices <- which(types %in% tumor)
matched.normal.indices <- which(types %in% matched.normal)
CN.tumor <- CN[tumor.indices,]
CN.matched.normal <- CN[matched.normal.indices,]

#Change IDs to allow for matching to metadata
CN.tumor[,1] <- gsub("-[0-9A-Z]*-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN.tumor[,1])
CN.matched.normal[,1] <- gsub("-[0-9A-Z]*-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN.matched.normal[,1])

--------------------

B

Create a markers object

require(gaia)

#Retrieve probes meta file from Broad Institute
gdac.root <- "ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/"
markersMatrix <- read.delim(paste0(gdac.root,"genome.info.6.0_hg19.na31_minus_frequent_nan_probes_sorted_2.1.txt"), as.is=TRUE, header=FALSE)
colnames(markersMatrix) <- c("Probe.Name", "Chromosome", "Start")

#Change sex chr names
unique(markersMatrix$Chromosome)
markersMatrix[which(markersMatrix$Chromosome=="X"),"Chromosome"] <- 23
markersMatrix[which(markersMatrix$Chromosome=="Y"),"Chromosome"] <- 24
markersMatrix$Chromosome <- sapply(markersMatrix$Chromosome, as.integer)

#Create a marker ID
markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))

#Remove duplicates
markersMatrix <- markersMatrix[-which(duplicated(markerID)),]

#Filter markersMatrix for common CNV
markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))
commonCNV <- read.delim(paste0(gdac.root,"CNV.hg19.bypos.111213.txt"), as.is=TRUE)
commonCNV[,2] <- sapply(commonCNV[,2], as.integer)
commonCNV[,3] <- sapply(commonCNV[,3], as.integer)
commonID <- apply(commonCNV, 1, function(x) paste0(x[2], ":", x[3]))
table(commonID %in% markerID)
table(markerID %in% commonID)
markersMatrix_fil <- markersMatrix[!markerID %in% commonID,]

#Create the markers object
markers_obj <- load_markers(markersMatrix_fil)

-------------------------------

C

# Determine recurrent aberrations in tumour

#Prepare CNV matrix
cnvMatrix <- CN.tumor

#Add label (0 for loss, 1 for gain)
#A segment mean of 0.3 is defined as the cut-off
cnvMatrix <- cbind(cnvMatrix, Label=NA)
cnvMatrix[cnvMatrix$Segment_Mean < -0.3,"Label"] <- 0
cnvMatrix[cnvMatrix$Segment_Mean > 0.3,"Label"] <- 1
cnvMatrix <- cnvMatrix[!is.na(cnvMatrix$Label),]

#Remove segment mean as we now go by the binary classification of gain or loss
cnvMatrix <- cnvMatrix[,-6]
colnames(cnvMatrix) <- c("Sample.Name", "Chromosome", "Start", "End", "Num.of.Markers", "Aberration")

#Substitute Chromosomes "X" and "Y" with "23" and "24"
xidx <- which(cnvMatrix$Chromosome=="X")
yidx <- which(cnvMatrix$Chromosome=="Y")
cnvMatrix[xidx,"Chromosome"] <- 23
cnvMatrix[yidx,"Chromosome"] <- 24
cnvMatrix$Chromosome <- sapply(cnvMatrix$Chromosome, as.integer)

#Run GAIA, which looks for recurrent aberrations in your input file
n <- length(unique(cnvMatrix[,1]))
cnv_obj <- load_cnv(cnvMatrix, markers_obj, n)
results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)
ADD COMMENTlink modified 8 months ago • written 22 months ago by Kevin Blighe55k

@ Kevin Blighe Hello Bro, can you help me again, I have installed R and I'm running your code line by line. I arrived in the line

#Create the markers object
markers_obj <- load_markers(markersMatrix_fil) 
Error in load_markers(markersMatrix_fil) : 
  could not find function "load_markers", plz fix this line to continue the other lines plz
ADD REPLYlink modified 22 months ago by genomax78k • written 22 months ago by Chaimaa160
2

Hey dude, ensure that the gaia package loaded correctly. load_markers() is part of that package.

ADD REPLYlink written 22 months ago by Kevin Blighe55k

Do you mean the line require (gaia)? I 'm following your code line by line and now i fail to continue...

ADD REPLYlink written 22 months ago by Chaimaa160
1

That's indeed what Kevin means.

ADD REPLYlink written 22 months ago by WouterDeCoster43k
1

Yep, ensure that gaia installed correctly (source("https://bioconductor.org/biocLite.R"); biocLite("gaia"))

load_markers
Erro: objeto 'load_markers' não encontrado

Load package and then try again:

require(gaia)
Carregando pacotes exigidos: gaia
Warning message:
package ‘gaia’ was built under R version 3.2.5 


load_markers

function (marker_matrix) 
{
    message("Loading Marker Informations")
    chromosomes <- sort(unique(marker_matrix[, 2]))
    chromosome_marker_list <- list()
    end_position <- FALSE
    if (ncol(marker_matrix) == 4) {
        end_position <- TRUE
    }
    for (i in as.numeric(chromosomes)) {
        chr_ids <- which(marker_matrix[, 2] == i)
        tmp_matrix <- matrix(0, 2, length(chr_ids))
        tmp_matrix[1, ] <- marker_matrix[chr_ids, 3]
        if (end_position) {
            tmp_matrix[2, ] <- marker_matrix[chr_ids, 4]
        }
        else {
            tmp_matrix[2, ] <- tmp_matrix[1, ]
        }
        chromosome_marker_list[[i]] <- tmp_matrix
        names(chromosome_marker_list)[[i]] <- i
        message(".", appendLF = FALSE)
    }
    message("\nDone")
    return(chromosome_marker_list)
}
<environment: namespace:gaia>
ADD REPLYlink modified 15 months ago • written 22 months ago by Kevin Blighe55k

Kevin Blighe Thanks Bro, i have installed all the packages successfully. Now, i have the following 2 issues in these corresponding lines

markers_obj <- load_markers(markersMatrix_fil)

Loading Marker Informations ......................Error in chromosome_marker_list[[i]] <- tmp_matrix :

attempt to select more than one element in integerOneIndex

In addition: Warning message:

In load_markers(markersMatrix_fil) : NAs introduced by coercion

colnames(df) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")

Error in colnames<-(*tmp*, value = c("Barcode", "chr", "start", "end", :

attempt to set 'colnames' on an object with less than two dimensions

ADD REPLYlink modified 22 months ago • written 22 months ago by Chaimaa160

Hey Chief, let me take a look later.

ADD REPLYlink written 22 months ago by Kevin Blighe55k

@ Kevin Blighe, sure Bro, take your time and thanks a lot for your reply.

ADD REPLYlink written 22 months ago by Chaimaa160
1

Just to be sure, you are running these commands first, starting with the data that you download from Broad FireBrowse:

Then, you run the steps that I originally mentioned:

It is very important that the load_markers function does not give any warnings. A successful execution of this function will produce:

markers_obj <- load_markers(markersMatrix_fil)
Loading Marker Informations
........................
Done

Sometimes, available memory is an issue.

ADD REPLYlink written 22 months ago by Kevin Blighe55k

@Kevin Blighe Yes I'm following your steps one by one and by the correct order.

ADD REPLYlink modified 5 months ago by Kevin Blighe55k • written 22 months ago by Chaimaa160
1

Which OS? How much RAM?

Can you definitely 100% confirm that each of these steps has completed successfully:

require(gaia)

#Retrieve probes meta file from broadinstitute website
gdac.root <- "ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/"
markersMatrix <- read.delim(paste0(gdac.root,"genome.info.6.0_hg19.na31_minus_frequent_nan_probes_sorted_2.1.txt"), as.is=TRUE, header=FALSE)
colnames(markersMatrix) <- c("Probe.Name", "Chromosome", "Start")

#Change sex chromosome names
unique(markersMatrix$Chromosome)
markersMatrix[which(markersMatrix$Chromosome=="X"),"Chromosome"] <- 23
markersMatrix[which(markersMatrix$Chromosome=="Y"),"Chromosome"] <- 24
markersMatrix$Chromosome <- sapply(markersMatrix$Chromosome, as.integer)

#Create a marker ID
markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))

#Remove duplicates
markersMatrix <- markersMatrix[-which(duplicated(markerID)),]

#Filter markersMatrix for common CNV
markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))
commonCNV <- read.delim(paste0(gdac.root,"CNV.hg19.bypos.111213.txt"), as.is=TRUE)
commonCNV[,2] <- sapply(commonCNV[,2], as.integer)
commonCNV[,3] <- sapply(commonCNV[,3], as.integer)
commonID <- apply(commonCNV, 1, function(x) paste0(x[2], ":", x[3]))
table(commonID %in% markerID)
table(markerID %in% commonID)
markersMatrix_fil <- markersMatrix[!markerID %in% commonID,]

#Create the markers object
markers_obj <- load_markers(markersMatrix_fil)
ADD REPLYlink written 22 months ago by Kevin Blighe55k

@ Kevin Blighe I do every step and run it carefully, i can see the output variables in the Global environment of R right corner.

Can i share my data with you and you can test your code on it?

My OS is windows10 and RAM is 12GB and my data size is 6.87 MB.

ADD REPLYlink modified 22 months ago • written 22 months ago by Chaimaa160

@ Kevin Blighe Bro, i have fixed the first error. Now, these 2 parts are not working

    cnv_obj <- load_cnv(cnvMatrix, markers_obj, n)
Loading Copy Number Data
Error in matrix(0, length(samples), ncol(markers_list[[chromosomes[i]]])) : 
  non-numeric matrix extent



results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)

Performing Data Preprocessing
Error in runGAIA(cnv_obj, markers_obj, output_file_name = "Tumor.All.txt",  : 
  object 'cnv_obj' not found

    colnames(df) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")
Error in `colnames<-`(`*tmp*`, value = c("Barcode", "chr", "start", "end",  : 
  attempt to set 'colnames' on an object with less than two dimensions
ADD REPLYlink written 22 months ago by Chaimaa160
1

There must be something wrong with your cnvMatrix. In my example, I us the endometrial cancer (UCEC) data. Which cancer's data do you have? Just check the output of each command in Steps 1 and 3 in order to ensure that it works.

If you want to tell me the file that you downloaded, then I can also take a look here.

Peace.

ADD REPLYlink modified 15 months ago • written 22 months ago by Kevin Blighe55k

@ Kevin Blighe ok the cancer Data which i'm working on is the prostate adenocarcinoma (PRAD).

ADD REPLYlink written 22 months ago by Chaimaa160

@ Kevin Blighe, Hi i know that i disturb you a lot I apologize for that and i really really appreciate your help! Did y check your code with my data, i really need to do this job so pls give help me out.

ADD REPLYlink modified 5 months ago by Kevin Blighe55k • written 22 months ago by Chaimaa160
1

Hey chief - no problem. I went here: http://firebrowse.org/?cohort=PRAD&download_dialog=true

I then downloaded the file: 'genome_wide_snp_6-FFPE-segmented_scna_minus_germline_cnv_hg19' and then used that as the data input.

I then followed all of my code and it worked fine. The only thing that I notice is that there is no copy number data for normal samples in the PRAD dataset, but this is not important because the germline copy number is already subtracted by Broad Firebrowse.

cnv_obj <- load_cnv(cnvMatrix, markers_obj, n)
Loading Copy Number Data
................................................
Done

Then:

results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)

Performing Data Preprocessing

Done

Computing Discontinuity Matrix
........................
Done
Computing Probability Distribution
................................................................................................
Done
Assessing the Significance of Observed Data
................................................
Done
Writing Tumor.All.txt.igv.gistic File for Integrative Genomics Viewer (IGV) Tool
................................................
Done
Running Homogeneous peel-off Algorithm With Significance Threshold of 0.15 and Homogenous Threshold of 0.12
................................................
Done

Writing Output File 'Tumor.All.txt' Containing the Significant Regions

File 'Tumor.All.txt' Saved

The problem that you have may be an issue with a new version of one of the packages. Dude, to help, I have put the output files and my R session online for you. You can download them (3 files) here: https://app.box.com/s/992llscqeabyw3rzrjg5stxd6945r73r (you may have to open an account).

The Tumor.All.txt file contains the significant recurrent somatic copy number alterations (recSCNA). You will want to input these back into R and then annotate them using the first steps that I mention in this thread. You will have to rename the columns to have at least "chr", "start", "end"

ADD REPLYlink written 22 months ago by Kevin Blighe55k

@Kevin Blighe well done Bro!

But when i only downloaded the FFPE file like you and so my question is why do you use FFPE and not "genome_wide_snp_6-segmented_scna_minus_germline_cnv_hg19"? Can you plz answer my last question and i really really appreciated your help.

BEST WISHES.

ADD REPLYlink modified 22 months ago • written 22 months ago by Chaimaa160
1

Hey,

I just chose a file 'at random'. I was only testing. You should use the file more suitable to your experiment.

FFPE tissue is, of course, lower quality.

ADD REPLYlink modified 15 months ago • written 22 months ago by Kevin Blighe55k
1

Wait. Allow me to re-run that using the non-FFPE

ADD REPLYlink modified 22 months ago • written 22 months ago by Kevin Blighe55k

OK i can wait .

ADD REPLYlink modified 5 months ago by Kevin Blighe55k • written 21 months ago by Chaimaa160
1

Here you go, chief (same files but for PRAD.snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.seg.txt): https://app.box.com/s/992llscqeabyw3rzrjg5stxd6945r73r

Dude, the PRAD dataset is very big, so, the Rdata file is ~140 megabytes

ADD REPLYlink written 21 months ago by Kevin Blighe55k

@Kevin Blighe i'm very thankful and happy for the great help you gave me!

Plz can you also do the annotation Step for me I tried in my computer but it show me this error

    df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
Error: cannot allocate vector of size 124.9 Mb
ADD REPLYlink written 21 months ago by Chaimaa160

I have a size issue actually.

ADD REPLYlink written 21 months ago by Chaimaa160
1

Getting it to you now - hang on.

ADD REPLYlink modified 15 months ago • written 21 months ago by Kevin Blighe55k
1

I have uploaded the new Rdata file, and also a list of the final annotated regions. In the 'type' column, the following is true:

  • deletion=0
  • amplification=1

https://app.box.com/s/992llscqeabyw3rzrjg5stxd6945r73r

And that is it: recurrent somatic copy number alterations (recSCNA) in the TCGA PRAD dataset. If you want to cite how the data was processed, then just cite my recent publication: https://www.ncbi.nlm.nih.gov/pubmed/29682207

ADD REPLYlink modified 15 months ago • written 21 months ago by Kevin Blighe55k

@ Kevin Blighe ,Thanks a lot. Yes definitely i will cite your recent publication in my manuscript and Plz if you have any other publications share with me via e-mail or link. Your topic is of great interest for me . go ahead......

ADD REPLYlink modified 5 months ago by Kevin Blighe55k • written 21 months ago by Chaimaa160

@Kevin Blighe Hi bro,
Finally, I got a server and start installing the packages but I failed to install "Gaia" package

**package ‘gaia’ is not available (for R version 3.2.3)**
ADD REPLYlink modified 15 months ago by RamRS25k • written 16 months ago by Chaimaa160
1

Perhaps open a new question, and show that it has a definitive relationship to bioinformatics.

ADD REPLYlink written 16 months ago by Kevin Blighe55k

Ok, Kevin Blighe Bro, but I think this problem has no relation to bioinformatics!

  source("https://www.bioconductor.org/biocLite.R")
    Error in file(filename, "r", encoding = encoding) :
      cannot open the connection
    In addition: Warning message:
    In file(filename, "r", encoding = encoding) :
      unable to connect to 'bioconductor.org' on port 80

. The probem seems related to the server may be

ADD REPLYlink written 16 months ago by Chaimaa160

Try: source("https://bioconductor.org/biocLite.R") (without www)

If that still does not work, then try: source("http://bioconductor.org/biocLite.R") (without http)

If that does not work, then you can post on Stack Exchange

ADD REPLYlink modified 15 months ago • written 16 months ago by Kevin Blighe55k
1

@Kevin Blighe I tried with http and with https and nothing work!
I will try in Stack Exchange thanks always for your great help Kevin

ADD REPLYlink modified 15 months ago by RamRS25k • written 16 months ago by Chaimaa160
1

Please use the formatting bar (especially the code option) to present your post better. It has been done for you this time.
code_formatting

ADD REPLYlink modified 15 months ago • written 22 months ago by WouterDeCoster43k
 @Kevin Blighe Hi, Kevin Bro, Is there any possibility to output The Tumor.All.txt file with the samples too,  means the significant recurrent somatic copy number alterations (recSCNA) with their corresponding samples from the original input file or Barcode name. 
    Then, i can annotate this file? 

    Plz, reply me?
ADD REPLYlink modified 15 months ago • written 15 months ago by Chaimaa160
1

Hi, it would be dishonest and mal-practice for me to continually help you to that extent, and that would defeat the purpose of this website. You will have to find a way using your local resources, preferably in conjunction with your supervisor.

ADD REPLYlink modified 15 months ago • written 15 months ago by Kevin Blighe55k

Sorry Kevin

I have had called copy number variations by ascatNgs and tried to find significant aberrations by GISTIC.2.0 and this lines of my segmentation file

Sample  Chr Start   End Num_Prob    Segment_Mean
t_005   1   13116   2063094 996 -0.722466024
t_005   1   2065339 5926507 2675    -10.68825031
t_005   1   5927738 28678709    14307   -0.722466024
t_005   1   28681947    28707612    19  -10.68825031
t_005   1   28708846    29440873    425 -0.722466024
t_005   1   29442291    31262499    1196    -10.68825031
t_005   1   31263510    121484236   57505   -0.722466024

where I did not find any genes significantly amplified or deleted. My boss now has asked me to give him a list of amplified and deleted regions (and/or genes?) in each sample but I don't know how to do that; Can you please give me a clue by having such a such and where GISTIC not giving any recurrent amplifications/deletions, what should I do? Shall I still go through your tutorial to annotate this data to gene name?

Thank you for any help

ADD REPLYlink written 7 months ago by A3.7k
1

Yes, you could annotate these regions with genes. You could also just define a cut-off for Segment_Mean for amplification and deletion. For example, -10 is surely a deletion.

ADD REPLYlink written 7 months ago by Kevin Blighe55k

Thank you so much for your time and paying attention

For one sample I obtained this

> head(Final_regions)
    GeneSymbol   start    extra2 AberrantRegion        GeneRegion
1      C1orf86 2065339 -10.68825 1:5926507-2675 1:2115903-2144159
1.1        SKI 2065339 -10.68825 1:5926507-2675 1:2160134-2241558
1.2      MORN1 2065339 -10.68825 1:5926507-2675 1:2252692-2323146
1.3       RER1 2065339 -10.68825 1:5926507-2675 1:2323267-2336883
1.4      PEX10 2065339 -10.68825 1:5926507-2675 1:2336236-2345236
1.5      PLCH2 2065339 -10.68825 1:5926507-2675 1:2357419-2436969
> 


> tail(Final_regions)
        GeneSymbol    start    extra2    AberrantRegion
193.712       ARSA 18194188 -10.68825 22:51240820-21994
193.713     SHANK3 18194188 -10.68825 22:51240820-21994
193.714  RNU6-409P 18194188 -10.68825 22:51240820-21994
193.715        ACR 18194188 -10.68825 22:51240820-21994
193.716  RPL23AP82 18194188 -10.68825 22:51240820-21994
193.717     RABL2B 18194188 -10.68825 22:51240820-21994
                  GeneRegion
193.712 22:51061182-51066607
193.713 22:51112843-51171726
193.714 22:51129688-51129791
193.715 22:51176624-51183762
193.716 22:51195376-51239737
193.717 22:51205929-51222091
> 
> 


> dim(Final_regions)
[1] 31500     5

In this data frame only 90 genes have positive (+) sign for extra2 column (segment mean) and rest have negative(-) sign for extra2 column; Does that mean all these genes with negative(-) sign for extra2 column all have been deleted? Such a huge difference is normal?

ADD REPLYlink written 7 months ago by A3.7k

I do not know how you have produced that output. Generally, though, yes, the negative means lower copy number in tumour when compared to normal. It follows that positive means higher copy number in tumour when compared to normal.

ADD REPLYlink modified 7 months ago • written 7 months ago by Kevin Blighe55k

Thank you, I followed your tutorial for annotation; About output, you mean you are not sure how I have called copy number variations?

ADD REPLYlink written 7 months ago by A3.7k
1

Yes, but, now it is okay because you have told me that you have followed my tutorial. In fact, if you go to Part 1C ( C: How to extract the list of genes from TCGA CNV data ), you will see that we define the following:

  • Segment mean < -0.3 = loss
  • Segment mean > 0.3 = gain
ADD REPLYlink written 7 months ago by Kevin Blighe55k

sorry @Kevin

I only have followed your tutorial for annotation here https://www.biostars.org/p/311199/#311444; I mean I have used my own segment file here

https://www.dropbox.com/s/amhlvjbk6mx91tj/segmentationfile.txt?dl=0

I followed your tutorial on my own segment file to identify recurrent sCNA regions but I did not find anything as already GISTIC also did not give me any significant sCNA.

So, my boss asked me a list of amplified or deleted genes (even not significant), I used your tutorial for getting such a list by annotating segment file

ADD REPLYlink written 7 months ago by A3.7k

Sir,

as you mentioned pre-computed GISTIC 2.0 data for "UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt" this file. I am not understanding whether this is input or output of GISTIC 2.0. Is this file compiled version of tcga cnv data? can i use this file for gistic analysis?

Thank you

ADD REPLYlink written 7 weeks ago by akshay_ware10
1

I do not believe the data in UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt is the output of GISTIC; however, it can be used as the input to GISTIC.

The data in UCEC.Level_3_segmented_scna_minus_germline_cnv_hg19.seg.txt is likely produced from Circular Binary Segmentation (CBS).

ADD REPLYlink written 7 weeks ago by Kevin Blighe55k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1708 users visited in the last hour