Question: Biggest barriers to making sense of sequence data
gravatar for samgreenberg25
4.2 years ago by
samgreenberg2520 wrote:

This may be less technical than most questions here but it's something I've been wondering about for some time.

I have studied biology and to a lesser degree bioinformatics (one Python bioinfomatics course). It seem to me the holy grail of bioinformatics would be to plug in a sequence and be able to predict all the proteins in the organisms and how they would be expressed. Basically the complete inter-conversion of phenotype and genotype in silico, which is of course physicality possible but technically challenging.

My question is to what extent this is possible today and what are the biggest hurdles to accomplishing this. Realistically I'm thinking about less impressive comparisons, can we compare the sequence of a black Labrador and a white Labrador and confidently say which genes or promoters are responsible for the difference in color? Obviously when we know which metabolic pathway to check it makes it easier but how difficult is it to subtract one genome from another and then assign the phenotypic effect of each of those differences?

It seems to me that machine learning and other clever techniques have been implemented to solve problems like image recognition, natural language processing and other problems where the inputs are significantly less friendly to computer processing than sequencing data is.

Is it an issue of computing power, programming, insufficient sequencing data or something else?

gene • 1.2k views
ADD COMMENTlink modified 4.2 years ago by harold.smith.tarheel4.6k • written 4.2 years ago by samgreenberg2520

My 2p. About machine learning and image recognition, I'd point out that these application can rely on high quantity and quality of data to train the underlying algorithms. For example (I don' have numbers at hand), there must be millions of handwritten postcodes for which you know true answer and you can use them for the training, cross validation etc of image recognition methods. In biology it's much more difficult to generate such good training sets.

ADD REPLYlink written 4.2 years ago by dariober11k
gravatar for Devon Ryan
4.2 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

I'll add to Giovanni's reply by appending:

  • Often unclear which proteins are interacting in which cell types or under what conditions.
  • Often unclear exactly how and when changes in one cell type affect another.
  • Phenotype isn't completely encoded by DNA. The entire developmental process and the environment play a HUGE role in this.
ADD COMMENTlink written 4.2 years ago by Devon Ryan96k

Fair enough, but what about unicellular life? There are many strains of S. cerevisiae used in brewing beer, wine and bread. The limitations you describe would not be applicable in that case, how difficult would it be to ascribe genomic patterns to a specific ester profile or alcohol tolerance? Would you need tens of genomes? Thousands?

I'm going to disagree with your statement about how significantly the environment affects phenotype. Obviously epigenetics plays a part in the development of many organisms and some phenotypes like those related to X-inactivation or temperature dependent expression are pseudo-random. That said most traits are purely genetic.

ADD REPLYlink written 4.2 years ago by samgreenberg2520

You're last sentence is I think is the part you need to readdress. Yes the genome is very important, and yes, as an Epigenesist i'm a little biased towards the significance of Epigenetics. However, most traits are not purely genetic. Most known traits are purely genetic. Most traits mentioned in your text books are purely genetic. Most traits however, such as someone's tone of voice, are environmental.

ADD REPLYlink written 4.2 years ago by John12k
gravatar for John
4.2 years ago by
John12k wrote:
  • The genome sequence is not a cassette tape with a start and an end. It's more like a box of Lego. Yes, you need 4 wheel pieces if you want to make a car, but you also need them to be in the right place at the right time, and genomic sequence alone doesn't tell you that. Genomic sequencing just tells you if there are wheels in the box. RNA-Seq and other-seqs might shed some light on how they are used, but they are not particularly robust assays at present. In many ways, the genome is not nearly as important as what you build from it, and that's highlighted quite well by how much DNA we share with a Banana. It's just that you can't possible build a car without wheels, so current biology is all about looking for no-wheels-disease, because thats simple enough for our current understanding of biology. All the programs, computing power and sequencing data in the world won't help us build a better model of how to take wheels and blocks and build a car. We're missing multiple dimensions of data from the intra/extracellular environment.

  • I'm not a bioinformatician (yet) so my opinion on the biggest challenges in bioinformatics needs a big [citation needed], however I would say the biggest impediment from data nirvana is that we're not dealing with complexity very well. In many instances, the bioinformatician's promise to the biologist is "don't worry, i'll do all the computer stuff, you worry about the cell stuff", and I think that's fundamentally a bad approach. It's also true the other way around, so i'm not blaming anyone specifically here, this is just the system we ended up with. If I had to guess, i'd say 80-90% of programs are written without consideration given to someone else having to read the code. 9-19% of code is written with consideration given to someone just-like-them having to read the code, and knowing, for example, [::-1] means iterate an itterable backwards in Python, rather than using the reversed() generator. And <1% is written for Biologists who don't know all the neat tricks, and just want to know fundamentally what has happened to their data. Currently that 9% of code is probably the most popular and accounts for most of the software actually used, however I hope in the future the <1% code will be the most popular, even if it's slower to run. I could talk at depth about how to make software more understandable to Biologists, but like some sick joke I wouldn't recommend it. You asked what's best for humanity/science, not what's best for current working scientists, and theres a big difference. I think a lot of Biologists who can't analyse data should lose their jobs, and a lot of computer scientists who call themselves bioinformaticians because they wrote an aligner should lose their jobs too, but some of those people are my friends, and I want my friends to be happy and successful. You see the problem. Not to mention i'd be the first to get sacked ;)

  • Sticking with the good things come in threes meme, my final point would be that our incentives are all screwed up. That Biological Sciences have a very outdated way of doing things, and is not at all prepared for the digital age. Too much time/attention is spent on publishing in special journals, and not nearly enough money is spent on maintaining software, integrating databases/information, or educating users. Theres also much to much trust. There's a famous quote floating around the chatrooms - which may very well be untrue - that goes that someone hired an out-of-lab bioinformatician to get a certain result, and that result ended up contradicting the findings of their work, so they fired that bioinformatician and found another one that returned the 'correct' result. There's no evidence beyond the circumstantial for this, of course, so nothing to follow up on - but it doesn't matter if its true or not. It's plausible, and I can't see any way around it. Unlike physical experiments, you can pick and choose your statistical test, your mapper, your differentially expressed gene software. Your pathway analysis tool. And all the parameter switches in between. You can p-hack your way to a nobel prize. That has to change.

Anyway, that's enough unpopular opinions for one day :)

ADD COMMENTlink written 4.2 years ago by John12k

Are you responding to the right question? I'm having trouble understanding how your answer relates to my question.

By the way, to your first point, DNA is like a cassette tape. It does have a start and end (5' and 3') hell it even has an A side and a B side.

ADD REPLYlink written 4.2 years ago by samgreenberg2520

I was responding to

My question is to what extent this is possible today and what are the biggest hurdles to accomplishing this.

Where point 1 is how it's not possible with the current information, and points 2 and 3 are the biggest hurdles to accomplishing this in the future. And yes DNA does have sides and direction, thats why i chose cassette tapes for the analogy :) But unfortunately that's where the similarities end. A cassette of songs is sufficient to recreate the album, but a genome is not enough to recreate the organism.

ADD REPLYlink written 4.2 years ago by John12k
gravatar for harold.smith.tarheel
4.2 years ago by
United States
harold.smith.tarheel4.6k wrote:

I'd argue that the largest barrier is our limited understanding of the biology. And our ignorance compounds at each stage on the progression from genotype to phenotype. I'd say that genome sequencing and assembly is, for the most part, a solved problem. The prediction of coding sequences from that genome is reasonably good but incomplete (particularly for novel genes). The prediction of gene function is limited to homology to known proteins, and only sufficient for enzymatic classification (kinase, transcription factor, etc). Despite decades of research in transcriptional regulation, we can't predict gene regulatory networks with any confidence, and instead must resort to phenomenological descriptions via RNA-Seq, ChIP-Seq, miRNA-Seq, etc. Similarly, we can't predict de novo which components in the cell interact, and what the consequences of those interactions are. And sequence variation compounds those challenges, since, with the exception of a few well-characterized motifs (such as enzyme catalytic residues), we don't understand the consequences of those variants. And that's before considering the complexity of cell-cell interactions.

Ventner's minimal cell project highlights our lack of understanding. Fully one-third of the genes required for viability in the latest iteration have no known function.

Finally, the assertion of genetic determinism ("most traits are purely genetic") is either intentionally provocative or reflective of profound ignorance (apologies if that sounds harsh, but I'd respond the same way to any student of mine making this claim). Consider bacteriophage lambda, probably the best-characterized organism (from a mechanistic standpoint). It can adopt one of two life cycles, lytic or lysogenic. Independent of genetic variation, that decision is dictated solely by its environment.

ADD COMMENTlink written 4.2 years ago by harold.smith.tarheel4.6k

I'd say that genome sequencing and assembly is, for the most part, a solved problem.

People working with plants and other oddball genomes will beg to differ :-) Even sequencing is only "solved" for things we can tackle by current technologies.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by genomax90k

I said solved, not easy ;-).

Seriously, those genomes are challenging for NGS, but largely tractable by older methods (cosmid/YAC/BAC cloning, genome walking, etc.). And easier to address than our lack of knowledge at subsequent levels.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by harold.smith.tarheel4.6k

You (and I) sound old school. Not many young researchers today would want to do the (backbreaking) work of making cosmid/YAC/BAC libraries and then physical mapping. Those were not good old days but they did get us to where we are at now.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by genomax90k

Careful, we're starting to sound like our mentors ("You kids have it so easy, with your NEB catalog. In my day, if we wanted to digest DNA, we had to purify the restriction enzyme ourselves!").

Technology marches apace, which is (mostly) a good thing.

ADD REPLYlink written 4.2 years ago by harold.smith.tarheel4.6k
gravatar for Giovanni M Dall'Olio
4.2 years ago by
London, UK
Giovanni M Dall'Olio27k wrote:

I would say:

  • not enough sequences and data available for representing all variation
  • data not clean, insufficient metadata, data not mapped to ontologies
  • difficulty in integrating data.
ADD COMMENTlink written 4.2 years ago by Giovanni M Dall'Olio27k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 813 users visited in the last hour