Annotation of huge number of CNV files
0
0
Entering edit mode
4.6 years ago

Dear all,

I am trying to perform CNV analysis on TCGA data. I have around 1000 CNV files and I need to map the coordination to the gene names. The coordination of files differ from each other.

Is there any annotation system to do this automatically? Something like BiomaRt that we use for mapping Ensembl Ids to gene symbols?

Nazanin Hosseinkhan

CNV annotation TCGA • 4.0k views
0
Entering edit mode

1000 CNV files

It is unclear how the data you have looks like, please share an example.
VCF or bedpe files, perhaps?

0
Entering edit mode

Here is the header of one of my files:

Sample  Chromosome  Start   End Num_Probes  Segment_Mean
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   3218610 116928855   65578   0.0161
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   116929335   116930416   2   -2.1385
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   116931023   245993487   62790   0.0135
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   245994442   245998083   10  -0.7988
AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_G10_735552  1   246002964   247813706   589 0.0189

2
Entering edit mode

I guess you could convert this in a bed file and take a look at bedtools and annotate the intervals using a bed file off all exons/genes.

1
Entering edit mode

To annotate, just use GenomicRanges, as I do here: C: How to extract the list of genes from TCGA CNV data

However, it looks like you have downloaded the CN data separately for each patient? - you have "1000" CNV files.

You can download pre-computed GISTIC 2.0 scores from here (assuming that you're looking at breast cancer): http://firebrowse.org/?cohort=BRCA#

1. Go to SNP6 CopyNum (at right)
3. Then proceed from here to identify recurrent copy number alterations: C: How to extract the list of genes from TCGA CNV data
4. Then annotate your recurrent CN regions: A: How to extract the list of genes from TCGA CNV data
0
Entering edit mode

Thank u. I want to compare copy number variation between different cancer stages. Can I separate samples from different stages in this file and perform analysis as u suggested to me for each individual stages?

1
Entering edit mode

Yes, I think that sample information is retained in the GISTIC file. You could then separate it into different stages and identify the recurrent copy number alterations separately for each stage. That would be interesting.

The hypothesis, I guess, would be that higher frequency CN alterations correlate with higher tumour stage.

0
Entering edit mode

Thank u so much.

Wishing every thing goes well with you

1
Entering edit mode

خواهش مي كنم

0
Entering edit mode

wow, you know Farsi? It is great!

0
Entering edit mode

No, just an interest in all cultures, and I have many Iranian, Iraqi, Saudi Arabian friends.

1
Entering edit mode

I guessed you may have some Iranian colleagues. The true spelling of خواهش می کنم was so interesting for me:-)

0
Entering edit mode

Hi Kevin, In the last step of running the code I get this error message: "> results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.15)

Performing Data Preprocessing Error in if (length(aberrations) > 0 && aberrations[1] == 0) { : missing value where TRUE/FALSE needed In addition: Warning message: In runGAIA(cnv_obj, markers_obj, output_file_name = "Tumor.All.txt", : NAs introduced by coercion"

I tried to remove NAs from cnv_obj and markers_obj, but nothing changed. Can u guide me how to solve this problem?

0
Entering edit mode

Maybe try to increase the adjusted P value threshold from threshold=0.15 to threshold=0.2?

What is the output of head(cnv_obj) ?

0
Entering edit mode

I repeated the last command with adjusted p-value threshold 0.2,0.25,0.3 .... 1. But I got the same error message.

I could not get the head(cnv_obj) completely because it seems so huge.

Even though I separated different stages, but the whole process is so slow on my laptop. However the only error message that I received is that I already sent you

0
Entering edit mode

That command requires a lot of memory. For the TCGA breast cancer data (~1000 samples), it neither runs on my laptop with 16GB RAM. Do you have access to a supercompute cluster?

1
Entering edit mode

I will try to rerun this command on the server next week. However that server also does not have enough memory I guess.

I tried to increase R memory using "memory.limit(size=100000), however I got the same error message.

0
Entering edit mode

Hi Kevin, The following is the results of "Final_regions <- cbind(CN_ann[,c(6,2,5)], AberrantRegion, GeneRegion)":

extra2  chr extra1  AberrantRegion  GeneRegion
1   -0.0029 1   128686  TCGA-BJ-A0Z5-10A-01D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
49  0.0053  1   129016  TCGA-BJ-A0Z5-01A-11D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
109 0.0029  1   128972  TCGA-BJ-A0Z9-01A-11D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
188 9.00E-04    1   128847  TCGA-BJ-A0ZE-10A-01D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
278 0.0103  1   128923  TCGA-BJ-A192-10A-01D-A13V-01:3218610-247813706  ARHGEF16:1-3370990
444 7.00E-04    1   128933  TCGA-BJ-A28V-01A-11D-A19I-01:3218610-247813706  ARHGEF16:1-3370990
707 0.0012  1   128988  TCGA-BJ-A2NA-10A-01D-A19L-01:3218610-247813706  ARHGEF16:1-3370990
789 0.0076  1   129092  TCGA-BJ-A2NA-01A-12D-A19I-01:3218610-247813706  ARHGEF16:1-3370990
954 1.00E-04    1   129140  TCGA-BJ-A3PU-10A-01D-A21Y-01:3218610-247813706  ARHGEF16:1-3370990
1007    0.0018  1   129141  TCGA-BJ-A3PU-11A-11D-A21Y-01:3218610-247813706  ARHGEF16:1-3370990
1050    0.0038  1   129151  TCGA-BJ-A3PU-01A-11D-A21Y-01:3218610-247813706  ARHGEF16:1-3370990
1091    1.00E-04    1   128620  TCGA-BJ-A45C-10A-01D-A23T-01:3218610-247813706  ARHGEF16:1-3370990
1164    0.0231  1   128565  TCGA-BJ-A45C-01A-11D-A23T-01:3218610-247813706  ARHGEF16:1-3370990
1217    0.0021  1   128596  TCGA-BJ-A45G-10A-01D-A23T-01:3218610-247813706  ARHGEF16:1-3370990
1270    6.00E-04    1   128510  TCGA-BJ-A45G-11A-11D-A23T-01:3218610-247813706  ARHGEF16:1-3370990
1313    0.0029  1   128597  TCGA-BJ-A45G-01A-11D-A23T-01:3218610-247813706  ARHGEF16:1-3370990"


Is this what I have to get?

I am a little confused, because in " C: How to extract the list of genes from TCGA CNV data", you suggested two different codes. I got the above results from the first code.

0
Entering edit mode

By the way, when I got the head of "cnv_obj" from the second code, it was empty, just had the colunm label from X1 to X16384.

0
Entering edit mode

Okay, but did you solve the problem with memory / RAM? Also, can you confirm from where you obtained this data? - the Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/ )?

If you obtained the data direct from the GDC Data Portal, then that explains why you have ~1000 files, for, I assume, the breast cancer (BRCA) dataset. These files would be copy number regions identified in each sample. From these, the task would be:

1. identify somatic copy number alterations (SCNA) by comparing the tumour to normal - this can be done with GISTIC
2. identify recurrent SCNA (using the code, here: C: How to extract the list of genes from TCGA CNV data )
3. annotate the recurent SCNA (using the code, here: https://www.biostars.org/p/311199/#311444)

The other question thread became somewhat confusing.

Even though you have downloaded your ~1000 files, you do not need them. You can skip the first step that I mention (above) and just download the pre-computed SCNA data for BRCA by going here: http://firebrowse.org/?cohort=BRCA# (go to the SNP6 CopyNum and download the file genome_wide_snp_6-segmented_scna_minus_germline_cnv_hg19*). Then, start from Step 2.

1
Entering edit mode

I downloaded the thyroid cancer cnv data (scna minus germline) from FireBrowse as you suggested to me. I categorized the original file in to four stages according to the clinical data provided by FireBrowse.

Then I started to run this code from your comments on the R.3.5.0 installed on our Server:

require(data.table)
types <- gsub("TCGA-[A-Z0-9]*-[A-Z0-9]*-", "", (gsub("-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN[,1]))) unique(types) matched.normal <- c("10A", "10B", "11A", "11B") tumor <- c("01A", "01B", "06A") tumor.indices <- which(types %in% tumor) matched.normal.indices <- which(types %in% matched.normal) CN.tumor <- CN[tumor.indices,] CN.matched.normal <- CN[matched.normal.indices,] CN.tumor[,1] <- gsub("-[0-9A-Z]*-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN.tumor[,1])
CN.matched.normal[,1] <- gsub("-[0-9A-Z]*-[0-9A-Z]*-[0-9A-Z]*-01$", "", CN.matched.normal[,1]) require(gaia) gdac.root <- "ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/" markersMatrix <- read.delim(paste0(gdac.root,"genome.info.6.0_hg19.na31_minus_frequent_nan_probes_sorted_2.1.txt"), as.is=TRUE, header=FALSE)  . . . . . xidx <- which(cnvMatrix$Chromosome=="X") yidx <- which(cnvMatrix$Chromosome=="Y") cnvMatrix[xidx,"Chromosome"] <- 23 cnvMatrix[yidx,"Chromosome"] <- 24 cnvMatrix$Chromosome <- sapply(cnvMatrix\$Chromosome, as.integer) n <- length(unique(cnvMatrix[,1])) cnv_obj <- load_cnv(cnvMatrix, markers_obj, n) results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.2)

However in the last step ( "results.all <- runGAIA(cnv_obj, markers_obj, output_file_name="Tumor.All.txt", aberrations=-1, chromosomes=-1, num_iterations=10, threshold=0.2"), I got the

Performing Data Preprocessing Error in if (length(aberrations) > 0 && aberrations[1] == 0) { : missing value where TRUE/FALSE needed In addition: Warning message: In runGAIA(cnv_obj, markers_obj, output_file_name = "Tumor.All.txt", : NAs introduced by coercion" error message again.


When I read your post (C: How to extract the list of genes from TCGA CNV data ) again, I noticed that the first part of your post suggested different codes:

library(biomaRt)
library(GenomicRanges)
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
genes <- getBM(attributes=c("hgnc_symbol","chromosome_name","start_position","end_position"), mart=mart)
genes <- genes[genes[,1]!="" & genes[,2] %in% c(1:22,"X","Y"),]
xidx <- which(genes[,2]=="X")
yidx <- which(genes[,2]=="Y")
genes[xidx, 2] <- 23
genes[yidx, 2] <- 24
genes[,2] <- sapply(genes[,2],as.integer)
genes <- genes[order(genes[,3]),]
genes <- genes[order(genes[,2]),]
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)
colnames(CN) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")
CN

CN_GR <- makeGRangesFromDataFrame(CN, keep.extra.columns = TRUE)
CN_GR

hits <- findOverlaps(genes_GR, CN_GR, type="within")
CN_ann <- cbind(CN[subjectHits(hits),],genes[queryHits(hits),])

AberrantRegion <- paste0(CN_ann[,1],":", CN_ann[,3],"-", CN_ann[,4])
GeneRegion <- paste0(CN_ann[,7],":", CN_ann[,8],"-", CN_ann[,9])
Final_regions <- cbind(CN_ann[,c(6,2,5)], AberrantRegion, GeneRegion)
Final_regions[seq(1, nrow(Final_regions), 20),]


This time it seemed that everything goes well and I got

 extra2  chr extra1  AberrantRegion  GeneRegion
1   -0.0029 1   128686  TCGA-BJ-A0Z5-10A-01D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
49  0.0053  1   129016  TCGA-BJ-A0Z5-01A-11D-A10T-01:3218610-247813706  ARHGEF16:1-3370990


Can you tell me what is the difference between these two punch of codes? Thanks again for your prompt responses Kevin

1
Entering edit mode

The error message is from the runGAIA() command, and it is most likely due to insufficient amount of RAM / memory. This command is supposed to identify the recurrent SCNA regions in your data, i.e., the regions that are more frequently amplified or deleted in your samples.

The output of runGAIA() is then used with the biomaRt code in order to annotate the regions identified by runGAIA()

Currently, the results that you obtained:

1   -0.0029 1   128686  TCGA-BJ-A0Z5-10A-01D-A10T-01:3218610-247813706  ARHGEF16:1-3370990
49  0.0053  1   129016  TCGA-BJ-A0Z5-01A-11D-A10T-01:3218610-247813706  ARHGEF16:1-3370990


These are just your original Firebrowse regions and are not the recurrent SCNA regions.

Of course, you do not have to run the runGAIA() function and can just leave your data as the annotated SCNA regions (with runGAIA(), you would have recurrent SCNA regions).

1
Entering edit mode

Thank you so much for the comment. I will ask for larger amount of memory for the Server. I prefer to use runGAIA() to identify the recurrent SCNA.

0
Entering edit mode

Hi Kevin, I got the final result :"Tumor.all.txt". Now I want to use Biomart to annotate the significant regions.

Do I have to use "Tumor.all.txt" directly? I introduced "Tumor.all.txt" to Biomart, however I got an error message:

CN<-read.table("Tumor.All.txt",header=TRUE,sep="\t")
> colnames(CN) <- c("Barcode", "chr", "start", "end", "extra1", "extra2")
> CN
Barcode chr     start       end extra1     extra2
1      12   0  22360223  22360249     27 0.00000000
2      16   0  87123875  87125822   1948 0.14878710
3       6   0 124234359 124236318   1960 0.01409259
4       7   0  21173828  21180990   7163 0.13414790
5       8   0 139112201 139112236     36 0.00000000
6       7   1  24039878  24040383    506 0.00000000
> CN_GR <- makeGRangesFromDataFrame(CN, keep.extra.columns = TRUE)
> CN_GR
GRanges object with 6 ranges and 3 metadata columns:
seqnames              ranges strand |   Barcode    extra1             extra2
<Rle>           <IRanges>  <Rle> | <integer> <integer>          <numeric>
[1]        0   22360223-22360249      * |        12        27                  0
[2]        0   87123875-87125822      * |        16      1948          0.1487871
[3]        0 124234359-124236318      * |         6      1960 0.0140925857142857
[4]        0   21173828-21180990      * |         7      7163          0.1341479
[5]        0 139112201-139112236      * |         8        36                  0
[6]        1   24039878-24040383      * |         7       506                  0
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
> # Overlap your regions with the reference dataset that you created
>
> hits <- findOverlaps(genes_GR, CN_GR, type="within")
Warning message:
In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:
- in 'x': 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
- in 'y': 0
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
> CN_ann <- cbind(CN[subjectHits(hits),],genes[queryHits(hits),])
[1] Barcode    chr        start      end        extra1     extra2     GeneSymbol Chr        Start      End
<0 rows> (or 0-length row.names)
> //////////////////////////////////////////////////////////////////
Error: unexpected '/' in "/"
> # To make filtering easier, we can further manipulate the data to show the precise co-ordinates of each gene and the region co-ordinates in which it was found to have CNA / CNV:
> AberrantRegion <- paste0(CN_ann[,1],":", CN_ann[,3],"-", CN_ann[,4])
> GeneRegion <- paste0(CN_ann[,7],":", CN_ann[,8],"-", CN_ann[,9])
> Final_regions <- cbind(CN_ann[,c(6,2,5)], AberrantRegion, GeneRegion)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 0, 1

0
Entering edit mode

Can u tell me what threshold should I consider for the qvalue column of the Tumor.all.txt"? Is 0.15 a standard cutoff?or I have use different cutoff?

2
Entering edit mode

For the annotation, you will have to just check the output of each command to ensure that it is working. I can see that some of your chromosome ('seqnames') in CN_GR are labelled as 0, when they should be 1-24 (chr1-22, chrX, chrY).

The cut-off of FDR Q < 0.15 is not really standard - there is no real standard. You can equally adjust it to Q < 0.05 or Q < 0.20. It relates to the False Discovery Rate.

0
Entering edit mode

Thank you Kevin, Can I use VEP(variant effect predictor) for annotating the regions? I tried the VEP and everything seems OK

1
Entering edit mode

Sure, VEP is also a good option outside of R.

0
Entering edit mode

Hi Kevin, If I understood correctly, we can exclude common variations from copy number alteration data using this punch of code:

markerID <- apply(markersMatrix, 1, function(x) paste0(x[2], ":", x[3]))
commonCNV[,2] <- sapply(commonCNV[,2], as.integer)
commonCNV[,3] <- sapply(commonCNV[,3], as.integer)
commonID <- apply(commonCNV, 1, function(x) paste0(x[2], ":", x[3]))
table(commonID %in% markerID)
table(markerID %in% commonID)
markersMatrix_fil <- markersMatrix[!markerID %in% commonID,]


If you remember I decided to annotate the resulted CNA using VEP. Do I have to filter my data using " exclude common variants" in VEP? or they have already been excluded by using above code?

1
Entering edit mode

In this code (above), the common variants are excluded.

0
Entering edit mode

Thank you. So I can consider the entire results as novel

0
Entering edit mode

Hi Kevin, Sorry me for my nonstop questions. I got the annotation results of my copy number alterations from VEP. Now I want to get the CNV plot of my results.

Can you advise me what is the best way to do this? Nazanin

0
Entering edit mode

Hey, at this point, you may want to open a new question and show examples of the data that you have, and an example of the result that you what. This thread is very long, at this stage.

0
Entering edit mode

Sure. I will open a new question and show some parts of my data

0
Entering edit mode

hI @nazaninhoseinkhan how you divided the original file into four stages plz share this information, i need it ???

1
Entering edit mode

Hi, I used the clinical data provided for my cancer of interest. I filter the clinical data for the specific stage and then use the TCGA codes for separating the original CNV files.

0
Entering edit mode

@nazaninhoseinkhan Hi, There are many samples in CNV data and every sample correspond to 23 chromosomes. which sample i have to take and compare it with the clinical data samples with their corresponding stages??? Plz help me in this issue and if you have any codes plz share it, friend,

1
Entering edit mode

If I remember, there were 4 different files for each sample.

normal hg19, tumor hg19, normal hg18 and tumor hg18.

Depends on what reference genome version (hg18 or hg19) you want to use you can select them.

0
Entering edit mode

from where can i get normal hg19, tumor hg19?

1
Entering edit mode

On Broad Firebrowse, there should be just a single file, which contains the somatic copy number alterations for all samples in a specific TCGA dataset. These relate to the copy number regions called in the sampels, minus the calls in the matched germline / normal samples.

Conversely, if you download the data directly from the GDC website, then you will have to apply a lot of extra processing on these sampels. The GDC contains a single file per sample.

Final: The data on the Broad Firebrowse has already processed the GDC data for you, and saves you a lot of time.

0
Entering edit mode

Hi @nazaninhoseinkhan, could you plz tell me how you solve the memory issue?? I have faced the same problem.

1
Entering edit mode

I re-run my analysis on a cluster (server) and the problem was solved.

0
Entering edit mode

is this server available in your LAB or what?

1
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below: