Pruning CNA Data from TCGA
Entering edit mode
3.3 years ago
jrlarsen • 0


I downloaded the CNA data from TCGA (GDC) which is pre-segmented by CBS using the DNAcopy library from Bioconductors (Level 3). I am currently analyzing the data but cannot find a way eliminate noise in the form of very short segments that do not match the surrounding segments of longer probe length. In other words, I have consecutive segments on chromosome 2 where the first has 122511 probes with segment mean .0235, 3 probes with segment mean -1.5194, and 9606 probes with segment mean .0224. These short segments (low number of probes) that drastically differ from their neighbor segments that are much longer are all over my data from the TCGA and I do not know how to remove them properly after segmentation (since that is how the data comes). I have read up on pruning methods via dynamic programming and square mean, but they seem to take place prior to segmentation. I can use any help you are willing to give me, I am lost and dont know what to do next.

Thank you


TCGA CNV TCGAbiolinks Segmentation DNAcopy • 1.6k views
Entering edit mode
3.3 years ago
pbpanigrahi ▴ 410

Gistic2 is a popular tool people use for identifying regions of the genome that are significantly amplified or deleted across a set of samples.

It uses parameters such as -maxseg, -maxspace and -js to control the segments to use.

-maxseg: Maximum number of segments allowed for a sample in the input data. Samples with more segments than this threshold are excluded from the analysis. (DEFAULT=2500)

-js: Smallest number of markers to allow in segments from the segmented data. Segments that contain fewer than this number of markers are joined to the neighboring segment that is closest in copy number. (DEFAULT=4)

-maxspace: Maximum allowed spacing between pseudo-markers, in bases. Pseudo-markers are generated when the markers file input is omitted. Segments that contain fewer than this number of markers are joined to the neighboring segment that is closest in copy number. (DEFAULT=10,00)

Gistic2 is widely used tool for array based CNA identification tool. cBioportal uses Gistic2 for this. If you don't prefer to use Gistic2, then probably above parameters can be tried out.

Hope this helps

Entering edit mode

This is extremely helpful,thank you! This is the first I have heard of GISTIC and looks like what I need. I cannot find a library for R that runs GISTIC, only via the terminal. Do you know any way to run it directly through R? R is the platform I am running everything through.

Edit: Grammar


Login before adding your answer.

Traffic: 2816 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6