Entering edit mode
3.7 years ago
Igor Filippov
•
0
I'd like to train a model on two sets of sequences:
- -249..+50 around the TSS of a set of genes
- Random 300bp sequences from non-coding regions
I have trouble sampling the latter. My idea was to randomly generate positions in the genome, and for each such random position check whether it is in a "gene" region, by examining the GFF3 annotation. If it's not with a gene, it can use it as a random non-coding sequence.
However, I was wondering if this problem has already been solved before and there're existing tools.
Thanks in advance.