I was just doing something similar about a week ago.
You may be able to accomplish this using the
GenomicFeatures R package.
First load up the following in R:
Then you will need to get the chromosome sizes file, which you can generate with directions from this post: Get chromosome sizes from fasta file (basically you need the fasta file of the genome, and then you use sam tools to get the chrominfo/chrom sizes file)
then read in that file into R with:
chrominfo <- read.table(file = 'your/file/path/sizes.genome', sep = '\t')
colnames(chrominfo) <- c("chrom", "length")
then you should be able to plug it into
GenomicFeatures using (you might have to download the gtf file instead from the NCBI link you provided, because I think
GenomicFeatures only supports
gtf file formats):
purple.urchin.txdb <- GenomicFeatures::makeTxDbFromGFF(organism = "Strongylocentrotus purpuratus",
format = "gtf",
file = "~/your/path/here/GCF_000002235.5_Spur_5.0_genomic.gtf",
chrominfo = chrominfo)
and then you can get exons in bed format using (I am unsure if this follows your criteria for: (1) One record for each unique, non-overlapping exon):
exons <- exonsBy(purple.urchin.txdb, by = c("gene"))
exons <- unlist(exons)
as for (2) One record for the longest transcript of each protein-coding gene:
transcripts <- transcriptsBy(purple.urchin.txdb, by = "gene")
transcripts <- unlist(transcripts)
Maybe someone could give an answer/comment with details on how to obtain the required criteria you need. But this is a start that maybe you could play around with.
I do have to note that I tried to make a txdb object for mouse using the Gencode vM27 GTF file and I don't think I obtained all the elements when compared to just obtaining the txdb object from ensembl via
EDIT: Sept. 10 2021 - 17:42EST - Nevermind on the information below: I checked
organisms <- GenomeInfoDb::listOrganisms(), and I don't see the
Strongylocentrotus purpuratus on the list. Therefore, I think the information below will not work...
With the above being said, there may be a way to make a txdb from ensembl directly...:
It may be something like this:
purple.urchin.txdb <- makeTxDbFromEnsembl(organism = "Strongylocentrotus purpuratus", server = "ensembldb.ensembl.org", username = "anonymous", port = "3337")
and then you could continue with
I do see that ensembl does have the information for it, just not sure exactly how to input it into
This might help you find the correct server address? https://useast.ensembl.org/info/data/mysql.html