Question

Incorporating Ensembl and Fasta Data

0

Entering edit mode

6.7 years ago

sacamanobob800 • 0

Hello to all, this is my first question. I am trying to create training data for a neural network that will predict DNA binding sites of proteins. I have Ensembl data to fill in vectors that I dedicate to each nucleotide. Three sample rows from the data is below.

chrom   txStart     txEnd       name                score   strand  cdsStart    cdsEnd score exonCount exonLengths exonPos

chr4    11051044    11076204    ENSMUST00000058183  0   -   11051096    11076176    0   9   181,57,102,134,103,57,123,100,225,  0,2387,7967,9816,11005,15606,19155,22099,24935,
chr3    121723536   121735052   ENSMUST00000029771  0   +   121723714   121734239   0   6   266,106,218,179,160,947,    0,1440,5800,7987,8844,10569,
chr5    44472132    44799707    ENSMUST00000070748  0   -   44473281    44799493    0   8   1380,152,124,84,123,173,103,346,    0,6960,8093,60528,62905,69554,197224,327229,

I am able to get some data from Ensembl data such as on which strand the nucleotide is or whether the nucleotide is in an exon. I also want to indicate the type of nucleotide (A, T, C, G, N) in the vector dedicated to it; here I face a problem. In the Ensembl data, I have fractions of sequences but in the Fasta data I have all DNA sequenced. For the nucleotides that are not included in the Enseml data I will only place ATCGN information in their vectors, and for the ones in the Ensembl data I will place more features. How can I incorporate Ensembl and Fasta data to this end?

fasta genome sequence ensembl • 1.1k views

ADD COMMENT • link updated 6.7 years ago by Emily 23k • written 6.7 years ago by sacamanobob800 • 0