Question

How To Calculate Degree Of Deletion And Amplification Of Cnv Given Snp Array Data From Tcga?

1

Entering edit mode

10.7 years ago

jessada115 ▴ 40

I have CNV SNP array from TCGA that looks like

Sample    Chromosome    Start    End    Num_Probes    Segment_Mean
Sample1    1    61735    757469    46    0.5909
Sample1    1    757923    12852748    6470    -0.1666
Sample1    1    12857863    13776072    94    0.2141
Sample1    1    13776828    16149915    1792    -0.1672
Sample1    1    16153497    16155010    10    1.1636
Sample1    1    16165661    17012422    355    -0.1473
Sample1    1    17012456    17247727    81    0.1974
Sample1    1    17247845    25583341    5292    -0.1525
Sample1    1    25593128    25611452    14    -2.5747

and I'd like to convert it into the format that looks like

Gene1    0.2729
Gene2    -0.5803
Gene3    0.9857

In the result, '0.2729', '-0.5803', and '0.9857' are the degree of deletion and amplification. And 'Gene1', 'Gene2', 'Gene3' should be named according to HUGO standard.

Where can I find the tools that can do this kind of annotation?

cnv tcga snp gene • 7.6k views

ADD COMMENT • link updated 10.6 years ago by Yamol ▴ 40 • written 10.7 years ago by jessada115 ▴ 40

score 2 · Answer 1 · 2013-10-10

2

Entering edit mode

10.6 years ago

Yamol ▴ 40

what do Num_Probes and Segment_Mean stand for?

ADD COMMENT • link 10.6 years ago by Yamol ▴ 40

0

Entering edit mode

The number of consecutive probes that comprise that segment, and the mean value of thosr probes. See the documentation for the R DNAcopy package for more details.

ADD REPLY • link 10.6 years ago by Chris Miller 22k

0

Entering edit mode

Do you mean that the Segment_Mean stand for "log2(Detected Number/2)"? So for the numbers that > 0, they are amplification and <0, they are deletion? The Num_Probes seems that there's no need to use it for CNV.

ADD REPLY • link 10.6 years ago by Yamol ▴ 40

1

Entering edit mode

You might not need the number of probes, but for filtering and QC purposes, they can be invaluable, because a) probes are not evenly spaced and b) segments defined by larger numbers of probes are generally higher-confidence scores.

ADD REPLY • link 10.6 years ago by Chris Miller 22k

1

Entering edit mode

It's not quite that simple - if your value If your segment_mean is 0.07 (~= 2.1 copies), it's not particularly accurate to call that an amplification. The difference from 2 is usually just a result of noise. Setting reasonable thresholds for gain and loss is a hard problem, especially when you take into account things like subclonal copy number events in cancer.

ADD REPLY • link 10.6 years ago by Chris Miller 22k

0

Entering edit mode

Thanks so much! I really appreciate your help to my PhD candidate study!

ADD REPLY • link 10.6 years ago by Yamol ▴ 40

0

Entering edit mode

How did you calculate that there are ~=2.1 copies if the segment_mean is 0.07

ADD REPLY • link 8.1 years ago by khagay • 0

1

Entering edit mode

2^0.07*2 = 2.099433

I assume that I rounded :)

(edit - I screwed up while typing in a meeting earlier. You raise 2 to the nth power then multiply by two (since the assumption is that the normal sample is diploid)

ADD REPLY • link 8.1 years ago by Chris Miller 22k

score 1 · Answer 2 · 2013-09-02

To my knowledge, a such convertion tool doesn't exist. However i have 3 solutions for you. The first two solutions require coding skills.

The First : Get the coordinates of all the refseq from UCSC (Tool -> Table Browser). Then match the coordinates between the refseq file and your CNV file. You can compute the degree of deletion/amp by taking the absolute median for example.

The Second : The R package "cgdsr" which provides a basic set of R functions for querying the Cancer Genomics Data Server (CGDS) and in particular TCGA data

The Third : The cBioCancer Genomics Portal provides visualization, analysis and download of large-scale cancer genomics data sets including TCGA data. http://www.cbioportal.org/public-portal/

score 0 · Answer 3 · 2013-09-02

This is a straightforward coding exercise that can be accomplished with a few lines of perl, or by using something like bedTools.

Essentially, you're going to take a file containing coordinates for every gene, and intersect it with the regions of copy number alteration. Watch out for edge cases - what happens when a gene spans two or more copy number regions?