Question: Annotation of huge number of CNV files
0
gravatar for nazaninhoseinkhan
15 months ago by
Iran, Islamic Republic Of
nazaninhoseinkhan380 wrote:

Dear all,

I am trying to perform CNV analysis on TCGA data. I have around 1000 CNV files and I need to map the coordination to the gene names. The coordination of files differ from each other.

Is there any annotation system to do this automatically? Something like BiomaRt that we use for mapping Ensembl Ids to gene symbols?

I am looking forward to your comments

Nazanin Hosseinkhan

cnv annotation tcga • 1.2k views
ADD COMMENTlink written 15 months ago by nazaninhoseinkhan380

1000 CNV files

It is unclear how the data you have looks like, please share an example.
VCF or bedpe files, perhaps?

ADD REPLYlink written 15 months ago by WouterDeCoster42k

Here is the header of one of my files:

Sample  Chromosome  Start   End Num_Probes  Segment_Mean
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   3218610 116928855   65578   0.0161
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   116929335   116930416   2   -2.1385
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   116931023   245993487   62790   0.0135
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   245994442   245998083   10  -0.7988
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   246002964   247813706   589 0.0189
ADD REPLYlink modified 15 months ago by WouterDeCoster42k • written 15 months ago by nazaninhoseinkhan380
2

I guess you could convert this in a bed file and take a look at bedtools and annotate the intervals using a bed file off all exons/genes.

ADD REPLYlink written 15 months ago by WouterDeCoster42k
1

To annotate, just use GenomicRanges, as I do here: C: How to extract the list of genes from TCGA CNV data

However, it looks like you have downloaded the CN data separately for each patient? - you have "1000" CNV files.

You can download pre-computed GISTIC 2.0 scores from here (assuming that you're looking at breast cancer): http://firebrowse.org/?cohort=BRCA#

  1. Go to SNP6 CopyNum (at right)
  2. Download genome_wide_snp_6-segmented_scna_hg19 (MD5)
  3. Then proceed from here to identify recurrent copy number alterations: C: How to extract the list of genes from TCGA CNV data
  4. Then annotate your recurrent CN regions: A: How to extract the list of genes from TCGA CNV data
ADD REPLYlink modified 15 months ago • written 15 months ago by Kevin Blighe52k

Thank u. I want to compare copy number variation between different cancer stages. Can I separate samples from different stages in this file and perform analysis as u suggested to me for each individual stages?

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380
1

Yes, I think that sample information is retained in the GISTIC file. You could then separate it into different stages and identify the recurrent copy number alterations separately for each stage. That would be interesting.

The hypothesis, I guess, would be that higher frequency CN alterations correlate with higher tumour stage.

ADD REPLYlink modified 15 months ago • written 15 months ago by Kevin Blighe52k

Thank u so much.

Wishing every thing goes well with you

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380
1

خواهش مي كنم

ADD REPLYlink modified 15 months ago • written 15 months ago by Kevin Blighe52k

wow, you know Farsi? It is great!

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380

No, just an interest in all cultures, and I have many Iranian, Iraqi, Saudi Arabian friends.

ADD REPLYlink written 15 months ago by Kevin Blighe52k
1

I guessed you may have some Iranian colleagues. The true spelling of خواهش می کنم was so interesting for me:-)

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380

Hi Kevin, In the last step of running the code I get this error message: "> results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)

Performing Data Preprocessing Error in if (length(aberrations) > 0 && aberrations[1] == 0) { : missing value where TRUE/FALSE needed In addition: Warning message: In runGAIA(cnv_obj, markers_obj, output_file_name = "Tumor.All.txt", : NAs introduced by coercion"

I tried to remove NAs from cnv_obj and markers_obj, but nothing changed. Can u guide me how to solve this problem?

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380

Maybe try to increase the adjusted P value threshold from threshold=0.15 to threshold=0.2?

What is the output of head(cnv_obj) ?

ADD REPLYlink written 15 months ago by Kevin Blighe52k

I repeated the last command with adjusted p-value threshold 0.2,0.25,0.3 .... 1. But I got the same error message.

I could not get the head(cnv_obj) completely because it seems so huge.

Even though I separated different stages, but the whole process is so slow on my laptop. However the only error message that I received is that I already sent you

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380

That command requires a lot of memory. For the TCGA breast cancer data (~1000 samples), it neither runs on my laptop with 16GB RAM. Do you have access to a supercompute cluster?

ADD REPLYlink written 15 months ago by Kevin Blighe52k
1

I will try to rerun this command on the server next week. However that server also does not have enough memory I guess.

I tried to increase R memory using "memory.limit(size=100000), however I got the same error message.

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380

Hi Kevin, The following is the results of "Final_regions <- cbind(CN_ann[,c(6,2,5)], AberrantRegion, GeneRegion)":

extra2  chr extra1  AberrantRegion  GeneRegion
1   -0.0029 1   128686  TCGA-BJ-A0Z5-10A-01D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
49  0.0053  1   129016  TCGA-BJ-A0Z5-01A-11D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
109 0.0029  1   128972  TCGA-BJ-A0Z9-01A-11D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
188 9.00E-04    1   128847  TCGA-BJ-A0ZE-10A-01D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
278 0.0103  1   128923  TCGA-BJ-A192-10A-01D-A13V-01:3218610-247813706  ARHGEF16:1-3370990
444 7.00E-04    1   128933  TCGA-BJ-A28V-01A-11D-A19I-01:3218610-247813706  ARHGEF16:1-3370990
707 0.0012  1   128988  TCGA-BJ-A2NA-10A-01D-A19L-01:3218610-247813706  ARHGEF16:1-3370990
789 0.0076  1   129092  TCGA-BJ-A2NA-01A-12D-A19I-01:3218610-247813706  ARHGEF16:1-3370990
954 1.00E-04    1   129140  TCGA-BJ-A3PU-10A-01D-A21Y-01:3218610-247813706  ARHGEF16:1-3370990
1007    0.0018  1   129141  TCGA-BJ-A3PU-11A-11D-A21Y-01:3218610-247813706  ARHGEF16:1-3370990
1050    0.0038  1   129151  TCGA-BJ-A3PU-01A-11D-A21Y-01:3218610-247813706  ARHGEF16:1-3370990
1091    1.00E-04    1   128620  TCGA-BJ-A45C-10A-01D-A23T-01:3218610-247813706  ARHGEF16:1-3370990
1164    0.0231  1   128565  TCGA-BJ-A45C-01A-11D-A23T-01:3218610-247813706  ARHGEF16:1-3370990
1217    0.0021  1   128596  TCGA-BJ-A45G-10A-01D-A23T-01:3218610-247813706  ARHGEF16:1-3370990
1270    6.00E-04    1   128510  TCGA-BJ-A45G-11A-11D-A23T-01:3218610-247813706  ARHGEF16:1-3370990
1313    0.0029  1   128597  TCGA-BJ-A45G-01A-11D-A23T-01:3218610-247813706  ARHGEF16:1-3370990"

Is this what I have to get?

I am a little confused, because in " C: How to extract the list of genes from TCGA CNV data", you suggested two different codes. I got the above results from the first code.

ADD REPLYlink modified 15 months ago • written 15 months ago by nazaninhoseinkhan380

By the way, when I got the head of "cnv_obj" from the second code, it was empty, just had the colunm label from X1 to X16384.

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380

Okay, but did you solve the problem with memory / RAM? Also, can you confirm from where you obtained this data? - the Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/ )?

If you obtained the data direct from the GDC Data Portal, then that explains why you have ~1000 files, for, I assume, the breast cancer (BRCA) dataset. These files would be copy number regions identified in each sample. From these, the task would be:

  1. identify somatic copy number alterations (SCNA) by comparing the tumour to normal - this can be done with GISTIC
  2. identify recurrent SCNA (using the code, here: C: How to extract the list of genes from TCGA CNV data )
  3. annotate the recurent SCNA (using the code, here: https://www.biostars.org/p/311199/#311444)

The other question thread became somewhat confusing.

Even though you have downloaded your ~1000 files, you do not need them. You can skip the first step that I mention (above) and just download the pre-computed SCNA data for BRCA by going here: http://firebrowse.org/?cohort=BRCA# (go to the SNP6 CopyNum and download the file genome_wide_snp_6-segmented_scna_minus_germline_cnv_hg19*). Then, start from Step 2.

ADD REPLYlink modified 15 months ago • written 15 months ago by Kevin Blighe52k
1

I downloaded the thyroid cancer cnv data (scna minus germline) from FireBrowse as you suggested to me. I categorized the original file in to four stages according to the clinical data provided by FireBrowse.

Then I started to run this code from your comments on the R.3.5.0 installed on our Server:

require(data.table)
CN <- read.table("CNV-stageii.txt", sep="\t", stringsAsFactors=FALSE, header=TRUE) 
types <- gsub("TCGA-[A-Z0-9]*-[A-Z0-9]*-", "", (gsub("-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN[,1])))
unique(types)
matched.normal <- c("10A", "10B", "11A", "11B")
tumor <- c("01A", "01B", "06A") 
tumor.indices <- which(types %in% tumor)
matched.normal.indices <- which(types %in% matched.normal)
CN.tumor <- CN[tumor.indices,]
CN.matched.normal <- CN[matched.normal.indices,]
CN.tumor[,1] <- gsub("-[0-9A-Z]*-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN.tumor[,1])
CN.matched.normal[,1] <- gsub("-[0-9A-Z]*-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN.matched.normal[,1])
require(gaia)
gdac.root <- "ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/"
markersMatrix <- read.delim(paste0(gdac.root,"genome.info.6.0_hg19.na31_minus_frequent_nan_probes_sorted_2.1.txt"), as.is=TRUE, header=FALSE)

. . . . . xidx <- which(cnvMatrix$Chromosome=="X") yidx <- which(cnvMatrix$Chromosome=="Y") cnvMatrix[xidx,"Chromosome"] <- 23 cnvMatrix[yidx,"Chromosome"] <- 24 cnvMatrix$Chromosome <- sapply(cnvMatrix$Chromosome, as.integer) n <- length(unique(cnvMatrix[,1])) cnv_obj <- load_cnv(cnvMatrix, markers_obj, n) results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.2)

However in the last step ( "results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.2"), I got the

Performing Data Preprocessing Error in if (length(aberrations) > 0 && aberrations[1] == 0) { : missing value where TRUE/FALSE needed In addition: Warning message: In runGAIA(cnv_obj, markers_obj, output_file_name = "Tumor.All.txt", : NAs introduced by coercion" error message again.

When I read your post (C: How to extract the list of genes from TCGA CNV data ) again, I noticed that the first part of your post suggested different codes:

library(biomaRt)
library(GenomicRanges) 
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
xidx <- which(genes[,2]=="X")
yidx <- which(genes[,2]=="Y")
genes[xidx, 2] <- 23
genes[yidx, 2] <- 24
genes[,2] <- sapply(genes[,2],as.integer)
genes <- genes[order(genes[,3]),]
genes <- genes[order(genes[,2]),]
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)
colnames(CN) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")
CN

CN_GR <- makeGRangesFromDataFrame(CN, keep.extra.columns = TRUE)
CN_GR

hits <- findOverlaps(genes_GR, CN_GR, type="within")
CN_ann <- cbind(CN[subjectHits(hits),],genes[queryHits(hits),])
head(CN_ann)

AberrantRegion <- paste0(CN_ann[,1],":", CN_ann[,3],"-", CN_ann[,4])
GeneRegion <- paste0(CN_ann[,7],":", CN_ann[,8],"-", CN_ann[,9])
Final_regions <- cbind(CN_ann[,c(6,2,5)], AberrantRegion, GeneRegion)
Final_regions[seq(1, nrow(Final_regions), 20),]

This time it seemed that everything goes well and I got

 extra2  chr extra1  AberrantRegion  GeneRegion
    1   -0.0029 1   128686  TCGA-BJ-A0Z5-10A-01D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
    49  0.0053  1   129016  TCGA-BJ-A0Z5-01A-11D-A10T-01:3218610-247813706  ARHGEF16:1-3370990

Can you tell me what is the difference between these two punch of codes? Thanks again for your prompt responses Kevin

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380
1

The error message is from the runGAIA() command, and it is most likely due to insufficient amount of RAM / memory. This command is supposed to identify the recurrent SCNA regions in your data, i.e., the regions that are more frequently amplified or deleted in your samples.

The output of runGAIA() is then used with the biomaRt code in order to annotate the regions identified by runGAIA()

Currently, the results that you obtained:

1   -0.0029 1   128686  TCGA-BJ-A0Z5-10A-01D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
49  0.0053  1   129016  TCGA-BJ-A0Z5-01A-11D-A10T-01:3218610-247813706  ARHGEF16:1-3370990

These are just your original Firebrowse regions and are not the recurrent SCNA regions.

Of course, you do not have to run the runGAIA() function and can just leave your data as the annotated SCNA regions (with runGAIA(), you would have recurrent SCNA regions).

ADD REPLYlink modified 15 months ago • written 15 months ago by Kevin Blighe52k
1

Thank you so much for the comment. I will ask for larger amount of memory for the Server. I prefer to use runGAIA() to identify the recurrent SCNA.

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380

Hi Kevin, I got the final result :"Tumor.all.txt". Now I want to use Biomart to annotate the significant regions.

Do I have to use "Tumor.all.txt" directly? I introduced "Tumor.all.txt" to Biomart, however I got an error message:

CN<-read.table("Tumor.All.txt",header=TRUE,sep="\t")
> colnames(CN) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")
> CN
  Barcode chr     start       end extra1     extra2
1      12   0  22360223  22360249     27 0.00000000
2      16   0  87123875  87125822   1948 0.14878710
3       6   0 124234359 124236318   1960 0.01409259
4       7   0  21173828  21180990   7163 0.13414790
5       8   0 139112201 139112236     36 0.00000000
6       7   1  24039878  24040383    506 0.00000000
> CN_GR <- makeGRangesFromDataFrame(CN, keep.extra.columns = TRUE)
> CN_GR
GRanges object with 6 ranges and 3 metadata columns:
      seqnames              ranges strand |   Barcode    extra1             extra2
         <Rle>           <IRanges>  <Rle> | <integer> <integer>          <numeric>
  [1]        0   22360223-22360249      * |        12        27                  0
  [2]        0   87123875-87125822      * |        16      1948          0.1487871
  [3]        0 124234359-124236318      * |         6      1960 0.0140925857142857
  [4]        0   21173828-21180990      * |         7      7163          0.1341479
  [5]        0 139112201-139112236      * |         8        36                  0
  [6]        1   24039878-24040383      * |         7       506                  0
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
> # Overlap your regions with the reference dataset that you created
> 
> hits <- findOverlaps(genes_GR, CN_GR, type="within")
Warning message:
In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
  - in 'y': 0
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
> CN_ann <- cbind(CN[subjectHits(hits),],genes[queryHits(hits),])
> head(CN_ann)
 [1] Barcode    chr        start      end        extra1     extra2     GeneSymbol Chr        Start      End       
<0 rows> (or 0-length row.names)
> //////////////////////////////////////////////////////////////////
Error: unexpected '/' in "/"
> # To make filtering easier, we can further manipulate the data to show the precise co-ordinates of each gene and the region co-ordinates in which it was found to have CNA / CNV:
> AberrantRegion <- paste0(CN_ann[,1],":", CN_ann[,3],"-", CN_ann[,4])
> GeneRegion <- paste0(CN_ann[,7],":", CN_ann[,8],"-", CN_ann[,9])
> Final_regions <- cbind(CN_ann[,c(6,2,5)], AberrantRegion, GeneRegion)
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 0, 1
ADD REPLYlink written 15 months ago by nazaninhoseinkhan380

Can u tell me what threshold should I consider for the qvalue column of the Tumor.all.txt"? Is 0.15 a standard cutoff?or I have use different cutoff?

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380
2

For the annotation, you will have to just check the output of each command to ensure that it is working. I can see that some of your chromosome ('seqnames') in CN_GR are labelled as 0, when they should be 1-24 (chr1-22, chrX, chrY).

The cut-off of FDR Q < 0.15 is not really standard - there is no real standard. You can equally adjust it to Q < 0.05 or Q < 0.20. It relates to the False Discovery Rate.

ADD REPLYlink written 15 months ago by Kevin Blighe52k

Thank you Kevin, Can I use VEP(variant effect predictor) for annotating the regions? I tried the VEP and everything seems OK

ADD REPLYlink written 15 months ago by nazaninhoseinkhan380
1

Sure, VEP is also a good option outside of R.

ADD REPLYlink written 15 months ago by Kevin Blighe52k

Hi Kevin, If I understood correctly, we can exclude common variations from copy number alteration data using this punch of code:

markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))
commonCNV <- read.delim(paste0(gdac.root,"CNV.hg19.bypos.111213.txt"), as.is=TRUE)
commonCNV[,2] <- sapply(commonCNV[,2], as.integer)
commonCNV[,3] <- sapply(commonCNV[,3], as.integer)
commonID <- apply(commonCNV, 1, function(x) paste0(x[2], ":", x[3]))
table(commonID %in% markerID)
table(markerID %in% commonID)
markersMatrix_fil <- markersMatrix[!markerID %in% commonID,]

If you remember I decided to annotate the resulted CNA using VEP. Do I have to filter my data using " exclude common variants" in VEP? or they have already been excluded by using above code?

ADD REPLYlink written 14 months ago by nazaninhoseinkhan380
1

In this code (above), the common variants are excluded.

ADD REPLYlink written 14 months ago by Kevin Blighe52k

Thank you. So I can consider the entire results as novel

ADD REPLYlink written 14 months ago by nazaninhoseinkhan380

Hi Kevin, Sorry me for my nonstop questions. I got the annotation results of my copy number alterations from VEP. Now I want to get the CNV plot of my results.

Can you advise me what is the best way to do this? Nazanin

ADD REPLYlink written 14 months ago by nazaninhoseinkhan380

Hey, at this point, you may want to open a new question and show examples of the data that you have, and an example of the result that you what. This thread is very long, at this stage.

ADD REPLYlink written 14 months ago by Kevin Blighe52k

Sure. I will open a new question and show some parts of my data

ADD REPLYlink written 14 months ago by nazaninhoseinkhan380

hI @nazaninhoseinkhan how you divided the original file into four stages plz share this information, i need it ???

ADD REPLYlink written 13 months ago by Chaimaa150
1

Hi, I used the clinical data provided for my cancer of interest. I filter the clinical data for the specific stage and then use the TCGA codes for separating the original CNV files.

ADD REPLYlink written 13 months ago by nazaninhoseinkhan380

@nazaninhoseinkhan Hi, There are many samples in CNV data and every sample correspond to 23 chromosomes. which sample i have to take and compare it with the clinical data samples with their corresponding stages??? Plz help me in this issue and if you have any codes plz share it, friend,

ADD REPLYlink modified 13 months ago by Kevin Blighe52k • written 13 months ago by Chaimaa150
1

If I remember, there were 4 different files for each sample.

normal hg19, tumor hg19, normal hg18 and tumor hg18.

Depends on what reference genome version (hg18 or hg19) you want to use you can select them.

ADD REPLYlink written 13 months ago by nazaninhoseinkhan380

from where can i get normal hg19, tumor hg19?

ADD REPLYlink written 13 months ago by Chaimaa150
1

On Broad Firebrowse, there should be just a single file, which contains the somatic copy number alterations for all samples in a specific TCGA dataset. These relate to the copy number regions called in the sampels, minus the calls in the matched germline / normal samples.

Conversely, if you download the data directly from the GDC website, then you will have to apply a lot of extra processing on these sampels. The GDC contains a single file per sample.

Final: The data on the Broad Firebrowse has already processed the GDC data for you, and saves you a lot of time.

ADD REPLYlink written 13 months ago by Kevin Blighe52k

Hi @nazaninhoseinkhan, could you plz tell me how you solve the memory issue?? I have faced the same problem.

ADD REPLYlink written 13 months ago by Chaimaa150
1

I re-run my analysis on a cluster (server) and the problem was solved.

ADD REPLYlink written 13 months ago by nazaninhoseinkhan380

is this server available in your LAB or what?

ADD REPLYlink written 13 months ago by Chaimaa150
1

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 15 months ago by WouterDeCoster42k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 745 users visited in the last hour