Hi, I have a model that takes in a one-hot encoded sequence for a location on the human genome. However, I'm having trouble to find a way to not be limited by either time or space constraints when reading in the FASTA file with the sequences for the model. The file covers the whole human genome, so it's quite large. So far, I've used bedtools getfasta to essentially make a fasta file that has the sequences grouped how I need (in 100 bp bins). After that, I've tried to load that file and pre-generate all the one-hot encodings of the sequences necessary for my model, but this results in me running out of memory. Conversely, when I try and access the fasta and generate the one-hot encoding on the fly (as needed to input into the model), my performance is quite slow (I expect due to all the file i/o).
Does anyone have any suggestions for how to organize this sequence data/parse this fasta file in a fast way (i.e., avoiding both constant file i/o AND loading the entire file into memory)? Any help is very appreciated. Thanks!