Question

Extracting exons using GenomicFeatures is different from manual extraction

1

Entering edit mode

10 months ago

gernophil ▴ 80

If I try to extract the length of all exons (also those overlapping) using the GenomicFeatures R package with this code and this gencode file

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("tables/gencode.v43.basic.annotation.gtf.gz",
                        format = "gtf")
exons.list.per.gene <- exonsBy(txdb, by = "gene")
sort(width(exons.list.per.gene)[["ENSG00000000003.15"]])

The result is

[1]   75   84   99  108  135  189  189  199  312 2878 2879

(11 exons).

If I do it manually (with cleaning the file first using gzcat gencode.v43.basic.annotation.gff3.gz | grep -v # > gencode.v43.basic.annotation.clean.gff)

library(data.table)
gencode_file_gff3 <- fread("tables/gencode.v43.basic.annotation.clean.gff",
                           header = FALSE)
gencode_file_gff3 <- gencode_file_gff3[V3 == "exon"] # filtering for exons
gencode_file_gff3[, gene_id := gsub(".*gene_id=(.*);transcript_id.*", "\\1", V9)] # extractin ENSG number
gencode_file_gff3 <- gencode_file_gff3[, c("V4", "V5", "gene_id")] # reducing to start, end and gene_id
gencode_file_gff3[, length := V5 - (V4 - 1)] # calculating length
gencode_file_gff3 <- unique(gencode_file_gff3) # removing identical exons from HAVANA and ENSEMBL
sort(gencode_file_gff3[gene_id == "ENSG00000000003.15"]$length)

The result is

[1]   75   84   99  108  135  189  199  312 2878 2879

(10 exons). Without the unique command I retain 15 exons. So both is different from the Package. I can't figure out where this additional exon comes from. The unique command is necessary if two exons from HAVANA and ENSEMBL are exactly identical. I am not able to reproduce these results manually.

This is just exemplary, but I get slightly different numbers for most genes.

R GenomicFeatures • 1.0k views

ADD COMMENT • link updated 5 months ago by alejandrogzi ▴ 120 • written 10 months ago by gernophil ▴ 80

1

Entering edit mode

In case anyone is interested, I also asked on Github an their answer was: https://github.com/Bioconductor/GenomicFeatures/issues/60 Short: GenomicFeatures uses Ensembl exon ids for filtering (ENSE number). Some Exons are identical based on their genomic location, but not on their ENSE number. So if we take all exons using GenomicFeatures, we get more then with my code. If we would only take the non overlapping parts (i.e. the unique combined genomic regions), this effect should be gone.

ADD REPLY • link 10 months ago by gernophil ▴ 80

score 0 · Answer 1 · 2023-06-26

I made a script, that completely resembles the results from GenomicFeatures:

library(data.table)

gencode_file_gff3 <- fread("gzcat tables/gencode.v43.basic.annotation.gff3.gz | grep -v '^#'")
gencode_file_gff3 <- gencode_file_gff3[V3 == "exon"]

gencode_file_gff3_id <- copy(gencode_file_gff3)
gencode_file_gff3_symbol <- copy(gencode_file_gff3)

# per gene_id
gencode_file_gff3_id[, gene_id := gsub(".*gene_id=(.*);transcript_id.*", "\\1", V9)]
gencode_file_gff3_id[, gene_id := ifelse((V1 == "chrY") &
                                           is.element(gene_id,
                                                      gencode_file_gff3_id[V1 != "chrY"]$gene_id),
                                         paste0(gene_id, "_PAR_Y"), gene_id)]
gencode_file_gff3_id[, bases := Map(seq, V4, V5)]
gencode_file_gff3_id[, count_unique_bases := uniqueN(unlist(bases)), by = gene_id] # alternatively use: gencode_file_gff3_id[, count_unique_bases := length(unique(unlist(bases))), by = gene_id]
gencode_file_gff3_id <- gencode_file_gff3_id[, c("gene_id", "count_unique_bases")]
gencode_file_gff3_id <- unique(gencode_file_gff3_id)


# per gene_symbol
gencode_file_gff3_symbol[, gene_symbol := gsub(".*gene_name=(.*);transcript_type.*", "\\1", V9)]
gencode_file_gff3_symbol[, gene_symbol := ifelse((V1 == "chrY") &
                                           is.element(gene_symbol,
                                                      gencode_file_gff3_symbol[V1 != "chrY"]$gene_symbol),
                                         paste0(gene_symbol, "_PAR_Y"), gene_symbol)]
gencode_file_gff3_symbol[, bases := Map(seq, V4, V5)]
gencode_file_gff3_symbol[, count_unique_bases := uniqueN(unlist(bases)), by = gene_symbol] # alternatively use: gencode_file_gff3_symbol[, count_unique_bases := length(unique(unlist(bases))), by = gene_symbol]
gencode_file_gff3_symbol <- gencode_file_gff3_symbol[, c("gene_symbol", "count_unique_bases")]
gencode_file_gff3_symbol <- unique(gencode_file_gff3_symbol)

if you use a different features then exon from the GENCODE file, you might need to edit the parts with gsub to get the correct info.

score 0 · Answer 2 · 2023-11-01

Hi! in case anyone ends here looking to calculate non-overlapping exon lengths, I recently develop this: https://github.com/alejandrogzi/noel, an extremely fast non-overlapping exon length calculator written in Rust. Outperforms all the script/tools, check the tool post here: NOEL: An extremely fast Non-Overlapping Exon Length calculator written in Rust