Extracting exons and transcripts from gff3/gtf
4
0
Entering edit mode
6 weeks ago
mglasena ▴ 30

I'm hoping to create the following bed files from the purple urchin genome assembly:

(1) One record for each unique, non-overlapping exon

(2) One record for the longest transcript of each protein-coding gene.

I'm also hoping to eventually create sequence alignments for each transcript with the introns removed.

Are there software programs that help with this? The UCSC genome table browser does not allow you to select only one transcript per gene.

The NCBI Annotation report says that there are 245,575 exons in coding transcripts. I am not able to hit the number using simple UNIX commands. For example:

wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/235/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz | gunzip --stdout | awk '$3 == "exon"' | grep -i -e "XM_" -e "NM_" | cut -f 1,4,5 | bedtools sort -i - | uniq | wc -l 247301  assembly gtf annotation gff3 • 491 views ADD COMMENT 0 Entering edit mode You may want to check the following files from : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/235/GCF_000002235.5_Spur_5.0/  *_cds_from_genomic.fna.gz FASTA format of the nucleotide sequences corresponding to all CDS features annotated on the assembly, based on the genome sequence. See the "Description of files" section below for details of the file format. *_rna.fna.gz file FASTA format of accessioned RNA products annotated on the genome assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA products are not instantiated as a separate accessioned record in GenBank but are provided for some RefSeq genomes, most notably the eukaryotes.) The FASTA title is provided as sequence accession.version plus description. *_rna_from_genomic.fna.gz FASTA format of the nucleotide sequences corresponding to all RNA features annotated on the assembly, based on the genome sequence. See the "Description of files" section below for details of the file format.  ADD REPLY 2 Entering edit mode 6 weeks ago Juke34 ★ 6.3k ADD COMMENT 0 Entering edit mode Thanks so much for sharing this! This program (AGAT) is brilliant. It can do everything I asked about in the question posting and avoids the need to write custom code. ADD REPLY 0 Entering edit mode 6 weeks ago joe ▴ 230 Depending on exactly what you want to extract, just import into R/python, pull out whatever you want, and save as bed...just make sure your questions are biological 'legit' because excluding isoforms and/or overlapping genes like you ask seems odd. For example in R; library(rtracklayer) gff.path <- "~/GCF_000002235.5_Spur_5.0_genomic.gff" gff.data <- rtracklayer::readGFF(gff.path) gff.by.gene <- split(gff.data, gff.data$gene)
gene.len.long <- lapply(gff.by.gene, function(x) x[order(x$end - x$start),][1,])

#blah...
#blah...
#blah...

0
Entering edit mode
6 weeks ago
Pratik Mehta ▴ 600

I was just doing something similar about a week ago. You may be able to accomplish this using the GenomicFeatures R package.

First load up the following in R:

library(GenomicFeatures)
library(GenomicRanges)
library(rtracklayer)


Then you will need to get the chromosome sizes file, which you can generate with directions from this post: Get chromosome sizes from fasta file (basically you need the fasta file of the genome, and then you use sam tools to get the chrominfo/chrom sizes file)

then read in that file into R with:

chrominfo <- read.table(file = 'your/file/path/sizes.genome', sep = '\t')
colnames(chrominfo) <- c("chrom", "length")


then you should be able to plug it into GenomicFeatures using (you might have to download the gtf file instead from the NCBI link you provided, because I think GenomicFeatures only supports gff3 and gtf file formats):

purple.urchin.txdb <- GenomicFeatures::makeTxDbFromGFF(organism = "Strongylocentrotus purpuratus",
format = "gtf",
file =  "~/your/path/here/GCF_000002235.5_Spur_5.0_genomic.gtf",
chrominfo = chrominfo)


and then you can get exons in bed format using (I am unsure if this follows your criteria for: (1) One record for each unique, non-overlapping exon):

exons <- exonsBy(purple.urchin.txdb, by = c("gene"))
exons <- unlist(exons)
rtracklayer::export(exons,'/your/file/path/here/exons.bed')


as for (2) One record for the longest transcript of each protein-coding gene:

transcripts <- transcriptsBy(purple.urchin.txdb, by = "gene")
transcripts <- unlist(transcripts)
rtracklayer::export(transcripts,'your/file/path/here/transcripts.bed')


Maybe someone could give an answer/comment with details on how to obtain the required criteria you need. But this is a start that maybe you could play around with.

I do have to note that I tried to make a txdb object for mouse using the Gencode vM27 GTF file and I don't think I obtained all the elements when compared to just obtaining the txdb object from ensembl via makeTxDbFromEnsembl

EDIT: Sept. 10 2021 - 17:42EST - Nevermind on the information below: I checked organisms <- GenomeInfoDb::listOrganisms(), and I don't see the Strongylocentrotus purpuratus on the list. Therefore, I think the information below will not work...

With the above being said, there may be a way to make a txdb from ensembl directly...:

It may be something like this:

purple.urchin.txdb <- makeTxDbFromEnsembl(organism = "Strongylocentrotus purpuratus", server = "ensembldb.ensembl.org", username = "anonymous", port = "3337")


and then you could continue with exons <-...

I do see that ensembl does have the information for it, just not sure exactly how to input it into makeTxDbFromEnsembl: https://metazoa.ensembl.org/Strongylocentrotus_purpuratus/Info/Index?db=core

0
Entering edit mode
6 weeks ago
vkkodali ★ 2.7k

For a Unix command solution, see: Intron and exon coordinates of all genes from Hg19