Question

Pruning CNA Data from TCGA

0

Entering edit mode

5.8 years ago

jrlarsen • 0

Hello,

I downloaded the CNA data from TCGA (GDC) which is pre-segmented by CBS using the DNAcopy library from Bioconductors (Level 3). I am currently analyzing the data but cannot find a way eliminate noise in the form of very short segments that do not match the surrounding segments of longer probe length. In other words, I have consecutive segments on chromosome 2 where the first has 122511 probes with segment mean .0235, 3 probes with segment mean -1.5194, and 9606 probes with segment mean .0224. These short segments (low number of probes) that drastically differ from their neighbor segments that are much longer are all over my data from the TCGA and I do not know how to remove them properly after segmentation (since that is how the data comes). I have read up on pruning methods via dynamic programming and square mean, but they seem to take place prior to segmentation. I can use any help you are willing to give me, I am lost and dont know what to do next.

Thank you

Edit:Grammar

TCGA CNV TCGAbiolinks Segmentation DNAcopy • 2.3k views

ADD COMMENT • link updated 5.8 years ago by pbpanigrahi ▴ 420 • written 5.8 years ago by jrlarsen • 0

score 0 · Answer 1 · 2018-07-03

Gistic2 is a popular tool people use for identifying regions of the genome that are significantly amplified or deleted across a set of samples.

It uses parameters such as -maxseg, -maxspace and -js to control the segments to use.

-maxseg: Maximum number of segments allowed for a sample in the input data. Samples with more segments than this threshold are excluded from the analysis. (DEFAULT=2500)

-js: Smallest number of markers to allow in segments from the segmented data. Segments that contain fewer than this number of markers are joined to the neighboring segment that is closest in copy number. (DEFAULT=4)

-maxspace: Maximum allowed spacing between pseudo-markers, in bases. Pseudo-markers are generated when the markers file input is omitted. Segments that contain fewer than this number of markers are joined to the neighboring segment that is closest in copy number. (DEFAULT=10,00)

Gistic2 is widely used tool for array based CNA identification tool. cBioportal uses Gistic2 for this. If you don't prefer to use Gistic2, then probably above parameters can be tried out.

Hope this helps