Hi everyone! I am a university student working on my Master's thesis. I worked on a paper called Xpresso which has the purpose to predict the gene expression levels starting from DNA sequences using deep learning techniques. Now, my lecturers have asked me to create my own dataset made up of sequences and gene expression values. Usually the works which tackle this problem finds the locations of the TSSs of the various genes and cut a region with k bp downstream and upstram the TSS location, associating to such DNA sequences to a target value, which is a real number called gene expression level. I tried to cut the DNA of the reference genome fasta file using the gtf "gene" annotations, but I realized that it is not enough to cut the right regions, because the performances of my predictive model fall dramatically respect to the performance obtained on the Xpresso's dataset. So, I would ask to you:
- How would you create such dataset?
- I see someone that uses, BED files and BigWig files, can you explain me why could they be useful?
My specific knowledge of the bio domain is very poor, so any advice is valuable to me. Thanks to all!