News: Example of how bioinformaticians can publish in Scientific Reports (by nature publishing group) using publicly available NGS data
gravatar for David Langenberger
2.2 years ago by
David Langenberger8.7k wrote:

Changes of bivalent chromatin coincide with increased expression of developmental genes in cancer

enter image description here


Bivalent (poised or paused) chromatin comprises activating and repressing histone modifications at the same location. This combination of epigenetic marks at promoter or enhancer regions keeps genes expressed at low levels but poised for rapid activation. Typically, DNA at bivalent promoters is only lowly methylated in normal cells, but frequently shows elevated methylation levels in cancer samples. Here, we developed a universal classifier built from chromatin data that can identify cancer samples solely from hypermethylation of bivalent chromatin. Tested on over 7,000 DNA methylation data sets from several cancer types, it reaches an AUC of 0.92. Although higher levels of DNA methylation are often associated with transcriptional silencing, counter-intuitive positive statistical dependencies between DNA methylation and expression levels have been recently reported for two cancer types. Here, we re-analyze combined expression and DNA methylation data sets, comprising over 5,000 samples, and demonstrate that the conjunction of hypermethylation of bivalent chromatin and up-regulation of the corresponding genes is a general phenomenon in cancer. This up-regulation affects many developmental genes and transcription factors, including dozens of homeobox genes and other genes implicated in cancer. Thus, we reason that the disturbance of bivalent chromatin may be intimately linked to tumorigenesis.

read complete publication:

news ngs publication • 1.6k views
ADD COMMENTlink written 2.2 years ago by David Langenberger8.7k

This is an article in Scientific Reports, which is NPG's equivalent to PLOS One. Although I care more about the quality of the work than where it is published, I wouldn't refer to all journals published by NPG as "Nature".

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Lars Juhl Jensen11k

Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes

Check this article also. Completely done using publcially available microarray data, published in nature communications. I would highly recommend to check their supplement reproducible code. Beautiful R code on regression models and plotting, compiled with kntr!

ADD REPLYlink written 2.2 years ago by poisonAlien2.7k

Sorry... What is the purpose of this post? I mean, interesting paper but... Why posting it on Biostars?

ADD REPLYlink written 2.2 years ago by dariober9.9k

It nicely shows how one can use public available data, that were created to answer completely different questions, for a completely new analysis. And these results can be published in nature. I think that these are good news for bioinformaticians, who pretty often think they can only work with wet-labs and expensive sequencing runs.

ADD REPLYlink written 2.2 years ago by David Langenberger8.7k

Fair enough, but the I think the whole encode, 1000 genomes, blueprint, etc have been produced and made public in part with the idea of enabling other researchers to mine these data. There are a lot papars using these data, so I'm not sure this paper is any special in this respect.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by dariober9.9k

I did not claim that it is any special in this respect. It is just an example. I can delete the post, if you feel better then. I am not in the mood for this discussion, sorry. It is a new year and I do not want to spam anyone.

I know the people who wrote it and they were proud of the fact that they could publish it that high without expensive experiments. So I thought it might be worth to share this experience.

ADD REPLYlink written 2.2 years ago by David Langenberger8.7k

Sorry... I was just trying to understand...

ADD REPLYlink written 2.2 years ago by dariober9.9k

You don't have to be sorry. I got your point and changed the title. I just don't want to make a mountain out of a molehill.

I like discussions, but sometimes it is just not worth it. ;)

ADD REPLYlink written 2.2 years ago by David Langenberger8.7k

I had the same question as dariober, but your answer makes sense. Perhaps including that in the top post clarifies quite a bit.

ADD REPLYlink written 2.2 years ago by WouterDeCoster37k

Well, good point. I changed the title.

ADD REPLYlink written 2.2 years ago by David Langenberger8.7k
gravatar for John
2.2 years ago by
John12k wrote:

While data-driven science is obviously a nice prospect for people who live and breath biological data, there are some serious issues that I think need to be addressed before it can be accepted in quite the same way that traditional hypothesis-driven research can be. There can be over 1 million parameter tweaks in a given pipeline, all of which generate a different answer with a % probability of being true/false. While it would take an individual significant amounts of time to do 1 million different actual experiments, P-hacking/result-hacking can be done programatically overnight. I'm not saying that this paper or any other does indeed use such sneaky techniques - i'm just saying that for a researcher who needs to determine how reliable the findings of a paper are, you couldn't possibly know. Data driven research has yet to provide a reliable way to feel confident about the conclusions they are coming to. Between very small but highly significant changes, and publications that only document 10% of the computational work done, I find myself seeing, accepting, but never really believing, the conclusions of such papers. To my own loss most likely. Hopefully pure in silico research will find itself being replicated by others, hopefully with different computational tools but still arriving at the same answers. I hope that sort of thing becomes the norm.

ADD COMMENTlink written 2.2 years ago by John12k

Right. Without source code this paper is certainly no example of reproducible research, even though being both in silico and based on public data, it was well suited to be an examplar.

ADD REPLYlink written 2.2 years ago by Jeremy Leipzig18k

When should we expect to see a scientific report (or a real paper) from you? Is the thesis finally done? Since it is New Year thought I should ask :)

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by genomax63k

If you had asked me 6 months ago if i'd have it done by new year's i'd have said yes, absolutely - but alas i'm probably still a few weeks away. I've been very unwell the past two months (and i think it shows, i've been very inactive on Biostars lately), so everything has dragged out a bit longer than i'd hope.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by John12k

Sorry to hear that. Hope you feel better soon.

ADD REPLYlink written 2.2 years ago by genomax63k

This point is well taken. I completely agree. But how boring would be a bioinformaticians life without heuristics, statistics and black boxes. :)

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by David Langenberger8.7k

Heheh, well it would certainly be less interesting :) Although sometimes I feel like i'm a researcher of black boxes rather than biological data -_-;

ADD REPLYlink written 2.2 years ago by John12k

I agree completely that results should be considered provisional until replicated/validated, but the same is true of hypothesis-driven research. And many of the issues that you raise (cherry-picking data, inadequate documentation of methods) are not limited to data-driven science. While p-hacking may be easier, I can assure you that bench scientists (of which I am a member) are every bit as capable of manipulating data to get the results they want. Note that I am not so cynical as to believe that this behavior is the norm (for either data or bench science) but, as always, caveat emptor.

ADD REPLYlink written 2.2 years ago by harold.smith.tarheel4.3k

You're right, of course, however I still think that sort of trick is much harder to pull off on the wetlab side of things. The reagents are expensive, and the work laborious. I think we wetlab scientists often try multiple initial experiments, and unless they come back with promising results that path isn't pursued further unless there is really strong prior evidence that something interesting can be found.

Conversely, in silico, if you don't like the results DESeq gives you, there's always Cufflinks. I think that low time/resource cost to just try an experiment another way is the problem - and you're very right that it'll probably become more of a problem for wetlab stuff as more experiments become automated and are cheaper to perform. Hm.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by John12k
gravatar for Sinji
2.2 years ago by
UT Southwestern Medical Center
Sinji2.8k wrote:

This is interesting. I wonder if they set out to test this hypothesis and if so what made them interested in pursing this? Or if they were simply mining data and happened upon this discovery.

ADD COMMENTlink written 2.2 years ago by Sinji2.8k

They coincidentally saw this behaviour in lymphoma and then tested it in the other cancer types.

ADD REPLYlink written 2.2 years ago by David Langenberger8.7k
gravatar for Lluís R.
2.2 years ago by
Lluís R.810
Spain, Barcelona
Lluís R.810 wrote:

Interesting example! Thanks for sharing!

I am not much familiarized with methylation experiments, and I read the methods section with interest but I couldn't find any mention to normalizations applied (Makes me wonder if I need more background to understand how they do reach those conclusions or that I don't know how to read articles). Shouldn't each study be normalized to be compared? Aren't there batch/study effects?

Maybe that would be a question on its own but as you know the authors, maybe you are familiarized with the analysis.

ADD COMMENTlink written 2.2 years ago by Lluís R.810

I think this is a yes and no question. You are right, normally you would normalize all HM450k data together, to make them comparable among each other. But in my experience, the beta values are already somehow normalized, i.e. in [0,1], so the normalization with other arrays does not change a lot, as long as they have been normalized within any group. In this study we used an intra-array normalization (i.e. a methylation relative to the average beta-value of the same array) for the cancer-control classification. Thus, the data were normalized. For the expression/methylation relations, we did only do comparisons within the single studies, so we used the published, normalized data. As we found all cancers to behave similarly (and not groups from the same sequencing center etc), we concluded that the findings are not batch effects. This also proofs that the cancer effect is stronger than any possible batch effect.

ADD REPLYlink written 2.2 years ago by helene50

Thanks for your explanation! I must have missed it on the report :\

ADD REPLYlink written 2.2 years ago by Lluís R.810
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1765 users visited in the last hour