Question

Forum:DNA in pixels

3

Entering edit mode

3.4 years ago

mervsen ▴ 30

Hi, I am a newbie to bioinformatic analysis from a different background.

So I would like to know if my idea is totally dumb, or should I keep on it.

I wonder how it would be if we had stored the dna sequence data in pixels, giving different rgb values for different nucleotides. Wouldn't it be easier and user-friendly to store data, compare and find alignments using image processing tools? I imagine a standard coloring code for annotations which can be improved for further analysis. Finding similar sequences by overlapping images would be easy. I could use the machine learning algorithms to find patterns for distinguishing genes or non-coding sequences.

Does it make any sense to you or should I keep studying and don't waste my time for these ideas? :)

Thank you

genome alignment sequencing • 1.2k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 3.4 years ago by mervsen ▴ 30

2

Entering edit mode

One-hot encoding does this (or you can rethink your problem in this way): https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

ADD REPLY • link 3.4 years ago by Alex Reynolds 35k

2

Entering edit mode

There's plenty of different ways data can be encoded/represented. However not all of them are equally useful. You need to ask what the benefits of the change of representation are. As you've found already, image representations of sequences have been tried before but they haven't really had an impact in the field so the purported advantages may not be that compelling. The machine learning field has been looking at applying their new shiny toys to as many fields as possible and bioinformatics is no exception. There were quite a few papers a couple of years ago using images representation of sequences to use with CNNs. Problem is the published papers I've seen don't use a state of the art bioinformatics approach for comparison (if they even do a comparison) so I am unimpressed.

ADD REPLY • link 3.4 years ago by Jean-Karim Heriche 27k

score 6 · Answer 1 · 2020-11-18

In biological ML problems, DNA is often encoded using a 1-hot encoding, which I guess, in reality is similar to imagine encoding, but with 4 channels rather than 3.

See for an application example, pertty much anything by Anshul Kundajei, like this paper with Julia Zeitlinger: https://www.biorxiv.org/content/10.1101/737981v3.full.pdf

So for example, the 5 base sequence TATAC would be stored as

   1  2  3  4  5
A  0  1  0  1  0
T  1  0  1  0  0
G  0  0  0  0  0
C  0  0  0  0  1

Many standard imagine analysis ML network structures, such as deep convolutional networks, can then be applied to it. Now, really, you only need 3 rows to encode four bases, so you could encode it:

    1  2  3  4  5
A  0  1  0  1  0
T  1  0  1  0  0
G  0  0  0  0  0

With the assumption that a base with 0 in every position is C.

Now if we relabel those rows:

    1  2  3  4  5
R  0  1  0  1  0
B  1  0  1  0  0
G  0  0  0  0  0

We have something that looks remarkably like an unrolled bitmap image, but with only 1 bit for each of the RBG channels.

This is useful for applying DNN architectures designed for imagine analysis (or for that matter sound analysis) to DNA, but I'm not sure anyone uses this encoding for anything like alignment algorithms.

score 3 · Answer 2 · 2020-11-18

You can store DNA sequence data in anything, provided that the encoding and decoding —and indeed the format specification— are well-defined, and that the method of storage is not prone to decay or to becoming compromised / modified.

One can technically store DNA sequence data by arranging pieces of dirt in well-defined patterns on the floor. Although, when the cat or dog comes along, or if somebody opens the front door and a gust of wind comes in, then the data will be compromised.

Storing in pixels via RGB values would seem like a useful idea. I also had other ideas about how to store such data, e.g., an entire genome sequence, in a single image.

Kevin

score 2 · Answer 3 · 2020-11-18

On the data store, we use ASCII to encode sequences (and make them readable), but there are other formats like the 2bit format from UCSC that Blat can use for quick searches.

Regarding using ML methods for image analysis, I have a large doubt about what methods can you apply as those methods cannot consider evolution or similar concepts.

score 0 · Answer 4 · 2020-11-19

Thanks for sharing your thoughts. Finally I've found a paper I was looking for: http://www.basic.northwestern.edu/g-buehler/genomes/g_append.HTM But it was written in 2012, i'll try to find more.

While learning ML, for example we categorise ct images containing intracranial hemorrhage epidural or subdural. I meant if i had DNA sequences as images, for example let's say if i had images of bacteria and virus dna sequences giving a color value for each base, and if i had trained ML for distinguishing between them, would it be successful? And would it be improved to detect other patterns?

On the other hand, representing bases in colors, you could overlay images to detect the mathced areas just in one screen.

Thank you.