I would like to identify upstream ORFs in the 5' UTRs of genes within the genomes of several bat species (Pteropus alecto is one such genome, for example). These organisms seem to be somewhat annotated in NCBI, but 99% of the genes are 'predicted'. These genomes are not on Ensembl or within the UCSC genome browser. The lack of annotation is apparent when I view the genome files for download and see that they are listed in the directory CHR_UN, and not placed in directories labelled with specific chromosomes.
Based on this information, is it possible to still perform the analysis using listed gene locations, even though they are only predictions? My idea was to make a list of 5' UTRs for each predicted gene, then search for start codons/kozak sequence that also contain a stop codon. If this sounds feasible, how do I create a list of 5' UTRs from this type of data set?
Thanks for your suggestions.