Forum: DNA in pixels
3
gravatar for mervsen
4 days ago by
mervsen30
mervsen30 wrote:

Hi, I am a newbie to bioinformatic analysis from a different background.

So I would like to know if my idea is totally dumb, or should i keep on it.

I wonder how it would be if we had stored the dna sequence data in pixels, giving differend rgb values for different nucleotides. Wouldn't it be easier and user-friendly to store data, compare and find alignments using image processing tools? I imagine a standard coloring code for annotations which can be improved for further analysis. Finding similar sequences by overlapping images would be easy. I could use the machine learning alghoritms to find patterns for distinguishing genes or non-coding sequences.

Does it make any sense to you or should i keep studying and don't waste my time for these ideas? :)

Thank you.

ADD COMMENTlink modified 2 days ago by cmarmanjuriya0 • written 4 days ago by mervsen30
2

One-hot encoding does this (or you can rethink your problem in this way): https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

ADD REPLYlink modified 4 days ago • written 4 days ago by Alex Reynolds31k
2

There's plenty of different ways data can be encoded/represented. However not all of them are equally useful. You need to ask what the benefits of the change of representation are. As you've found already, image representations of sequences have been tried before but they haven't really had an impact in the field so the purported advantages may not be that compelling. The machine learning field has been looking at applying their new shiny toys to as many fields as possible and bioinformatics is no exception. There were quite a few papers a couple of years ago using images representation of sequences to use with CNNs. Problem is the published papers I've seen don't use a state of the art bioinformatics approach for comparison (if they even do a comparison) so I am unimpressed.

ADD REPLYlink written 4 days ago by Jean-Karim Heriche23k
6
gravatar for i.sudbery
4 days ago by
i.sudbery9.7k
Sheffield, UK
i.sudbery9.7k wrote:

In biological ML problems, DNA is often encoded using a 1-hot encoding, which I guess, in reality is similar to imagine encoding, but with 4 channels rather than 3.

See for an application example, pertty much anything by Anshul Kundajei, like this paper with Julia Zeitlinger: https://www.biorxiv.org/content/10.1101/737981v3.full.pdf

So for example, the 5 base sequence TATAC would be stored as

   1  2  3  4  5
A  0  1  0  1  0
T  1  0  1  0  0
G  0  0  0  0  0
C  0  0  0  0  1

Many standard imagine analysis ML network structures, such as deep convolutional networks, can then be applied to it. Now, really, you only need 3 rows to encode four bases, so you could encode it:

    1  2  3  4  5
A  0  1  0  1  0
T  1  0  1  0  0
G  0  0  0  0  0

With the assumption that a base with 0 in every position is C.

Now if we relabel those rows:

    1  2  3  4  5
R  0  1  0  1  0
B  1  0  1  0  0
G  0  0  0  0  0

We have something that looks remarkably like an unrolled bitmap image, but with only 1 bit for each of the RBG channels.

This is useful for applying DNN architectures designed for imagine analysis (or for that matter sound analysis) to DNA, but I'm not sure anyone uses this encoding for anything like alignment algorithms.

ADD COMMENTlink modified 4 days ago • written 4 days ago by i.sudbery9.7k
3
gravatar for Kevin Blighe
4 days ago by
Kevin Blighe67k
Republic of Ireland
Kevin Blighe67k wrote:

You can store DNA sequence data in anything, provided that the encoding and decoding —and indeed the format specification— are well-defined, and that the method of storage is not prone to decay or to becoming compromised / modified.

One can technically store DNA sequence data by arranging pieces of dirt in well-defined patterns on the floor. Although, when the cat or dog comes along, or if somebody opens the front door and a gust of wind comes in, then the data will be compromised.

Storing in pixels via RGB values would seem like a useful idea. I also had other ideas about how to store such data, e.g., an entire genome sequence, in a single image.

Kevin

ADD COMMENTlink modified 3 days ago • written 4 days ago by Kevin Blighe67k
2
gravatar for JC
4 days ago by
JC12k
Mexico
JC12k wrote:

On the data store, we use ASCII to encode sequences (and make them readable), but there are other formats like the 2bit format from UCSC that Blat can use for quick searches.

Regarding using ML methods for image analysis, I have a large doubt about what methods can you apply as those methods cannot consider evolution or similar concepts.

ADD COMMENTlink written 4 days ago by JC12k
0
gravatar for mervsen
4 days ago by
mervsen30
mervsen30 wrote:

Thanks for sharing your thoughts. Finally I've found a paper I was looking for: http://www.basic.northwestern.edu/g-buehler/genomes/g_append.HTM But it was written in 2012, i'll try to find more.

While learning ML, for example we categorise ct images containing intracranial hemorrhage epidural or subdural. I meant if i had DNA sequences as images, for example let's say if i had images of bacteria and virus dna sequences giving a color value for each base, and if i had trained ML for distinguishing between them, would it be successful? And would it be improved to detect other patterns?

On the other hand, representing bases in colors, you could overlay images to detect the mathced areas just in one screen.

Thank you.

ADD COMMENTlink written 4 days ago by mervsen30
1

How would the overlaying work with genomes that are not 100% colinear (100% synteny)?

ADD REPLYlink written 4 days ago by 5heikki9.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2088 users visited in the last hour