Question: Dataset preparation for driver/passenger gene discovery
gravatar for mhasa006
3.1 years ago by
United States
mhasa00650 wrote:

I want to work on a project to find driver/passenger genes in cancer dataset. I want to build a statistical model by analyzing various types of data such as copy number variation, gene expression, methylation, somatic point mutations etc. on TCGA datasets like BRCA. Can someone explain the preprocessing steps of these data set that can fit the model?

For example, If I want to build a classifier and the features are above-mentioned attributes from cancer dataset, what dataset I should download, and how do I process them so that the model understand them as features.

Pardon my naive question, I'm new at Cancer research and quite perplexed by the intricacies of the vast amount of cancer data.

cancer snp snv tcga gene • 955 views
ADD COMMENTlink modified 3.1 years ago by Kevin Blighe66k • written 3.1 years ago by mhasa00650
gravatar for Kevin Blighe
3.1 years ago by
Kevin Blighe66k
Kevin Blighe66k wrote:

Your question could not have been asked at a better time, as the area to which you are alluding is more or less called integrated omics, i.e., piecing together the clues given by each technology and technique used to probe disease in order to make better sense of disease. If we just look at DNA, we won't solve disease; if we look at just gene expression, we'll neither solve it; if we just look at metabolomics, same story... However, by piecing everything together, we stand a really great chance of understanding disease mechanisms better.

That said, it's not an easy task and the way of integrating all of these various datatypes is by no means standardised. From my perspective, it will have to be a very carefully planned and methodical process. My logical way of approaching it would be to analyse each type of data separately, build separate predictive (or other) models, and to then try to make sense of the results in combination. I would love to comment further but I've just been completing my own study on a type of cancer where I have specifically been integrating multiple data-types, and the manuscript has already been submitted.

There are tentative examples of analysing data-types together, most notably eQTL analysis, which looks at gene expression in combination with genotyping. It would also be possible to, for example, see how mutations in certain genes (DNA-seq) affect the level of copy number alterations (NGS or copy number array) in a tumour (indicative of genomic instability), or to check how, for example, an intergenic mutation (DNA-seq) produces a novel transcription factor binding site (ChIP-seq) and drives expression of a nearby gene (RNA-seq).

Other than what I've mentioned here, the best way to learn about how we process cancer data is by reading published manuscripts. There have been many great TCGA publications. Also, if you are really new at this area, then I would love for you to read the following manuscript, which should open your eyes as to the power of merging all of these data-types together in order to better understand disease:

I would also love for you to read this manuscript in order to open your eyes about the importance of 'thinking in 3D' in relation to DNA:


ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Kevin Blighe66k

I would just like to add that practically 'all' of the driver and passenger mutations have already been identified in the main cancers. However, it is the understanding of the underlying mechanisms that we don't yet understand in most cases.

ADD REPLYlink written 3.1 years ago by Kevin Blighe66k

Thank you very much for your detailed reply. I'll read these links and ask a further question if any.

ADD REPLYlink written 3.1 years ago by mhasa00650

Sure thing. The question was somewhat general, so, my answer was general. With cancer data, the possibilities are pretty much endless in relation to how you analyse it.

Let's also just mention the main problems that the field of cancer research is aiming to tackle next:

  • Early detection through the analysis of circulating tumour DNA (and detecting which circulating DNA fragments are indeed from the tumour and which are not)
  • Tumour clonality and how it hampers treatment strategies
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Kevin Blighe66k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 995 users visited in the last hour