Question

Biggest barriers to making sense of sequence data

2

Entering edit mode

7.8 years ago

samgreenberg25 ▴ 20

This may be less technical than most questions here but it's something I've been wondering about for some time.

I have studied biology and to a lesser degree bioinformatics (one Python bioinfomatics course). It seem to me the holy grail of bioinformatics would be to plug in a sequence and be able to predict all the proteins in the organisms and how they would be expressed. Basically the complete inter-conversion of phenotype and genotype in silico, which is of course physicality possible but technically challenging.

My question is to what extent this is possible today and what are the biggest hurdles to accomplishing this. Realistically I'm thinking about less impressive comparisons, can we compare the sequence of a black Labrador and a white Labrador and confidently say which genes or promoters are responsible for the difference in color? Obviously when we know which metabolic pathway to check it makes it easier but how difficult is it to subtract one genome from another and then assign the phenotypic effect of each of those differences?

It seems to me that machine learning and other clever techniques have been implemented to solve problems like image recognition, natural language processing and other problems where the inputs are significantly less friendly to computer processing than sequencing data is.

Is it an issue of computing power, programming, insufficient sequencing data or something else?

gene • 2.0k views

ADD COMMENT • link updated 7.8 years ago by harold.smith.tarheel ★ 4.9k • written 7.8 years ago by samgreenberg25 ▴ 20

0

Entering edit mode

My 2p. About machine learning and image recognition, I'd point out that these application can rely on high quantity and quality of data to train the underlying algorithms. For example (I don' have numbers at hand), there must be millions of handwritten postcodes for which you know true answer and you can use them for the training, cross validation etc of image recognition methods. In biology it's much more difficult to generate such good training sets.

ADD REPLY • link 7.8 years ago by dariober 14k

score 3 · Answer 1 · 2016-07-17

3

Entering edit mode

7.8 years ago

Devon Ryan 104k

I'll add to Giovanni's reply by appending:

Often unclear which proteins are interacting in which cell types or under what conditions.
Often unclear exactly how and when changes in one cell type affect another.
Phenotype isn't completely encoded by DNA. The entire developmental process and the environment play a HUGE role in this.

ADD COMMENT • link 7.8 years ago by Devon Ryan 104k

0

Entering edit mode

Fair enough, but what about unicellular life? There are many strains of S. cerevisiae used in brewing beer, wine and bread. The limitations you describe would not be applicable in that case, how difficult would it be to ascribe genomic patterns to a specific ester profile or alcohol tolerance? Would you need tens of genomes? Thousands?

I'm going to disagree with your statement about how significantly the environment affects phenotype. Obviously epigenetics plays a part in the development of many organisms and some phenotypes like those related to X-inactivation or temperature dependent expression are pseudo-random. That said most traits are purely genetic.

ADD REPLY • link 7.8 years ago by samgreenberg25 ▴ 20

1

Entering edit mode

You're last sentence is I think is the part you need to readdress. Yes the genome is very important, and yes, as an Epigenesist i'm a little biased towards the significance of Epigenetics. However, most traits are not purely genetic. Most known traits are purely genetic. Most traits mentioned in your text books are purely genetic. Most traits however, such as someone's tone of voice, are environmental.

ADD REPLY • link 7.8 years ago by John 13k

score 3 · Answer 2 · 2016-07-17

The genome sequence is not a cassette tape with a start and an end. It's more like a box of Lego. Yes, you need 4 wheel pieces if you want to make a car, but you also need them to be in the right place at the right time, and genomic sequence alone doesn't tell you that. Genomic sequencing just tells you if there are wheels in the box. RNA-Seq and other-seqs might shed some light on how they are used, but they are not particularly robust assays at present. In many ways, the genome is not nearly as important as what you build from it, and that's highlighted quite well by how much DNA we share with a Banana. It's just that you can't possible build a car without wheels, so current biology is all about looking for no-wheels-disease, because thats simple enough for our current understanding of biology. All the programs, computing power and sequencing data in the world won't help us build a better model of how to take wheels and blocks and build a car. We're missing multiple dimensions of data from the intra/extracellular environment.
I'm not a bioinformatician (yet) so my opinion on the biggest challenges in bioinformatics needs a big [citation needed], however I would say the biggest impediment from data nirvana is that we're not dealing with complexity very well. In many instances, the bioinformatician's promise to the biologist is "don't worry, i'll do all the computer stuff, you worry about the cell stuff", and I think that's fundamentally a bad approach. It's also true the other way around, so i'm not blaming anyone specifically here, this is just the system we ended up with. If I had to guess, i'd say 80-90% of programs are written without consideration given to someone else having to read the code. 9-19% of code is written with consideration given to someone just-like-them having to read the code, and knowing, for example, [::-1] means iterate an itterable backwards in Python, rather than using the reversed() generator. And <1% is written for Biologists who don't know all the neat tricks, and just want to know fundamentally what has happened to their data. Currently that 9% of code is probably the most popular and accounts for most of the software actually used, however I hope in the future the <1% code will be the most popular, even if it's slower to run. I could talk at depth about how to make software more understandable to Biologists, but like some sick joke I wouldn't recommend it. You asked what's best for humanity/science, not what's best for current working scientists, and theres a big difference. I think a lot of Biologists who can't analyse data should lose their jobs, and a lot of computer scientists who call themselves bioinformaticians because they wrote an aligner should lose their jobs too, but some of those people are my friends, and I want my friends to be happy and successful. You see the problem. Not to mention i'd be the first to get sacked ;)
Sticking with the good things come in threes meme, my final point would be that our incentives are all screwed up. That Biological Sciences have a very outdated way of doing things, and is not at all prepared for the digital age. Too much time/attention is spent on publishing in special journals, and not nearly enough money is spent on maintaining software, integrating databases/information, or educating users. Theres also much to much trust. There's a famous quote floating around the chatrooms - which may very well be untrue - that goes that someone hired an out-of-lab bioinformatician to get a certain result, and that result ended up contradicting the findings of their work, so they fired that bioinformatician and found another one that returned the 'correct' result. There's no evidence beyond the circumstantial for this, of course, so nothing to follow up on - but it doesn't matter if its true or not. It's plausible, and I can't see any way around it. Unlike physical experiments, you can pick and choose your statistical test, your mapper, your differentially expressed gene software. Your pathway analysis tool. And all the parameter switches in between. You can p-hack your way to a nobel prize. That has to change.

Anyway, that's enough unpopular opinions for one day :)

score 3 · Answer 3 · 2016-07-18

I'd argue that the largest barrier is our limited understanding of the biology. And our ignorance compounds at each stage on the progression from genotype to phenotype. I'd say that genome sequencing and assembly is, for the most part, a solved problem. The prediction of coding sequences from that genome is reasonably good but incomplete (particularly for novel genes). The prediction of gene function is limited to homology to known proteins, and only sufficient for enzymatic classification (kinase, transcription factor, etc). Despite decades of research in transcriptional regulation, we can't predict gene regulatory networks with any confidence, and instead must resort to phenomenological descriptions via RNA-Seq, ChIP-Seq, miRNA-Seq, etc. Similarly, we can't predict de novo which components in the cell interact, and what the consequences of those interactions are. And sequence variation compounds those challenges, since, with the exception of a few well-characterized motifs (such as enzyme catalytic residues), we don't understand the consequences of those variants. And that's before considering the complexity of cell-cell interactions.

Ventner's minimal cell project highlights our lack of understanding. Fully one-third of the genes required for viability in the latest iteration have no known function.

Finally, the assertion of genetic determinism ("most traits are purely genetic") is either intentionally provocative or reflective of profound ignorance (apologies if that sounds harsh, but I'd respond the same way to any student of mine making this claim). Consider bacteriophage lambda, probably the best-characterized organism (from a mechanistic standpoint). It can adopt one of two life cycles, lytic or lysogenic. Independent of genetic variation, that decision is dictated solely by its environment.

score 2 · Answer 4 · 2016-07-17

2

Entering edit mode

7.8 years ago

Giovanni M Dall'Olio 28k

I would say:

not enough sequences and data available for representing all variation
data not clean, insufficient metadata, data not mapped to ontologies
difficulty in integrating data.

ADD COMMENT • link 7.8 years ago by Giovanni M Dall'Olio 28k