I was looking into training a Machine Learning / Deep Learning Model using Bytes
. Recently I was working on a way to decrease the size of a .fasta
file using _bit shifting_ (i.e, converting one nucleotide which is normally 8 bytes and can be bought down to 4 bytes using this method)
And now that we are in the age of Machine Learning and Artificial Intelligence dominating the Industry or at least there has been a trend of that it got me thinking what if we can use the bytes to develop a model? The problem I can currently think of is it might .... might not be biologically relevant? I am not sure this is where I kinda started getting confused and Wanted to reach out on here.
how about fastq.gz?
So essentially you work on a compression algorithm, is it? If so, be sure to bechmark your idea against the hundreds of existing and fast compression methods, be it standard things such as gzip, bzip, zstd, or genomics-centered methods such as CRAM.
What does compression have to do with biology? Please explain better if I miss the point.
How is this different from Tried building a compact sequence format with 4-bit storage
My understanding is that the OP is considering building a model using the sequences in fastq files, so the compression is just an intermediate step. However, it is not clear to me what model s/he has in mind... Some sort of LLM using fastq data...?