Hello to all, this is my first question. I am trying to create training data for a neural network that will predict DNA binding sites of proteins. I have Ensembl data to fill in vectors that I dedicate to each nucleotide. Three sample rows from the data is below.
chrom txStart txEnd name score strand cdsStart cdsEnd score exonCount exonLengths exonPos
chr4 11051044 11076204 ENSMUST00000058183 0 - 11051096 11076176 0 9 181,57,102,134,103,57,123,100,225, 0,2387,7967,9816,11005,15606,19155,22099,24935,
chr3 121723536 121735052 ENSMUST00000029771 0 + 121723714 121734239 0 6 266,106,218,179,160,947, 0,1440,5800,7987,8844,10569,
chr5 44472132 44799707 ENSMUST00000070748 0 - 44473281 44799493 0 8 1380,152,124,84,123,173,103,346, 0,6960,8093,60528,62905,69554,197224,327229,
I am able to get some data from Ensembl data such as on which strand the nucleotide is or whether the nucleotide is in an exon. I also want to indicate the type of nucleotide (A, T, C, G, N) in the vector dedicated to it; here I face a problem. In the Ensembl data, I have fractions of sequences but in the Fasta data I have all DNA sequenced. For the nucleotides that are not included in the Enseml data I will only place ATCGN information in their vectors, and for the ones in the Ensembl data I will place more features. How can I incorporate Ensembl and Fasta data to this end?