Question: one hot encoding for human genome
gravatar for eneskemalergin
4.5 years ago by
eneskemalergin10 wrote:

Hello all, I am working on a project that requires a one hot encoded human genome as a input data. I came up with a Python script, however it's not efficient in memory. It's working but requires a lot of memory, for 3gb fasta file it uses almost 20gb memory. Can you help me improve the memory efficiency and (if possible) time efficiency of this algorithm

import pandas as pd
import numpy as np
from Bio import SeqIO
import time 
import h5py

def vectorizeSequence(seq):
    # the order of the letters is not arbitrary.
    # Flip the matrix up-down and left-right for reverse compliment
    ltrdict = {'a':[1,0,0,0],'c':[0,1,0,0],'g':[0,0,1,0],'t':[0,0,0,1], 'n':[0,0,0,0]}
    return np.array([ltrdict[x] for x in seq])

starttime = time.time()
fasta_sequences = SeqIO.parse(open("hg38_new.fa"),'fasta')
with h5py.File('genomeEncoded.h5', 'w') as hf:
    for fasta in fasta_sequences:
        # get the fasta files.
        name, sequence =, fasta.seq.tostring()
        # Write the chromosome name
        # one hot encode...
        print name + " is one hot encoded!"
        # write to hdf5 
        hf.create_dataset(name, data=data)
        print name + " is written to dataset"

endtime = time.time()       
print "Encoding is done in " + str(endtime)

Thanks in advance...

machine learning genome • 3.6k views
ADD COMMENTlink modified 4.5 years ago by Steven Lakin1.5k • written 4.5 years ago by eneskemalergin10

What is a a hot encoded genome for the benefit of non-programmers? Is it this?

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by GenoMax92k

Yes in fact am also curious to know what is hot encoded? is it a jargon or some memory efficiency method?

ADD REPLYlink written 4.5 years ago by ivivek_ngs5.0k

For problems that are categorical, such as representing a nucleotide in a set {A, C, G, T}, we can encode each nucleotide in a separate column of a matrix where column 1 corresponds to A being present, column 2 to C being present, and so on. This kind of representation is called one-hot encoding, because only one bit is set to true. There are other ways to encode categorical data, but this is a common one for creating a numeric matrix that can be fed into a machine learning algorithm.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Steven Lakin1.5k
gravatar for Steven Lakin
4.5 years ago by
Steven Lakin1.5k
Fort Collins, CO, USA
Steven Lakin1.5k wrote:

Does your downstream machine learning application require this format? It is common to encode the standard DNA bases bitwise:

A = 0b00
C = 0b01
G = 0b10
T = 0b11

This will compress your encoding to two bits per nucleotide. Right now you're multiplying the size by 4x and incurring the cost of numpy array, though this is fairly well optimized behind the scenes. However, if you do 2-bit encoding, you need to know how to compute distances on encoded DNA strings, which usually means you'd have to implement the machine learning algorithm yourself.

If your downstream application requires vectorized one-hot encoding as you've shown us here, I don't think you can do better than your numpy code, though you may gain performance by using tuples in your dictionary instead of lists. Perhaps you could look into SciPy's implementation of sparse matrices for better memory management, but I'm not sure what the results will be.

ltrdict = {'a':(1,0,0,0),'c':(0,1,0,0),'g':(0,0,1,0),'t':(0,0,0,1), 'n':(0,0,0,0)}

As an aside, it has been my experience that machine learning with large genomic data in Python incurs a very high overhead cost and is slow due to the nature of the language. C-based languages are likely best for these kinds of applications.

ADD COMMENTlink written 4.5 years ago by Steven Lakin1.5k

I actually tried, bitwise encoding but did not work as I expected. Maybe I could not manage to do it correct, I don't know.

ADD REPLYlink written 4.5 years ago by eneskemalergin10
gravatar for Brian Bushnell
4.5 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

It looks like you want to convert to a 4-bit encoding. I don't really understand your code - it looks like it's translating bases to arrays of numbers, which would be really inefficient... normally you should translate: A -> 1 C -> 2 G -> 4 T -> 8

You can simply read 1 base at a time, convert, and write 1 base at a time. There's no reason to use any memory beyond the minimum for efficient buffering (a few KB which will probably be handled automatically).

P.S. If the output needs to be packed at 2 bases per byte, you would need to process 2 bases at a time, and know the endian-ness required... there's actually not enough information given to answer the question, in any case.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Brian Bushnell17k

The reason why I want encoded input, is the for the deep learning model it needs to take all the nucleotide as the same weight. If I chose to give them 1, 2 4, 8 then there will be bias between the weights, the output will be compromised

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by eneskemalergin10

Oh - I guess I don't really understand what kind of format the tool needs. 1, 2, 4, and 8 are all the same weight in binary; each has three 0s and a single 1.

ADD REPLYlink written 4.5 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2182 users visited in the last hour