Question: Different results from biomaRt query and search in GTF
0
gravatar for lizaveta
7 weeks ago by
lizaveta0
lizaveta0 wrote:

I need all the exons coordinates for a specific gene transcript (let's say, "ENST00000269305"). I try to do it with 2 different methods:

  1. using getBM from biomaRt

    library("biomaRt")
    human = useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl",host="grch37.ensembl.org", path="/biomart/martservice",ensemblRedirect = FALSE)
    transcript_info = data.frame(unique(getBM(attributes = c("chromosome_name", "genomic_coding_start","genomic_coding_end"), 
                               filters="ensembl_transcript_id", values="ENST00000269305",mart = human)))
    transcript_info[order(transcript_info$genomic_coding_start),]

    chromosome_name genomic_coding_start genomic_coding_end 11 17 7572927 7573008 10 17 7573927 7574033 7 17 7576853 7576926 6 17 7577019 7577155 5 17 7577499 7577608 4 17 7578177 7578289 3 17 7578371 7578554 2 17 7579312 7579590 1 17 7579700 7579721 9 17 7579839 7579912 8 17 NA NA

  2. my function, that uses GTF file (downloaded from Ensembl, filtered to have only protein-coding genes)

    library(rtracklayer)
    library(GenomicRanges)
    gtf = import.gff("~/Documents/data/grch37/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.ProteinCodingGenes_filtered.gtf")
    gtf_exons <- gtf[gtf$type == "exon"]
    transcript_info = gtf_exons[gtf_exons$transcript_id == "ENST00000269305"]
    transcript_info = data.frame(transcript_info[,c()])
    transcript_info = transcript_info[,c(1:3)]
    colnames(transcript_info) = c("chromosome_name", "genomic_coding_start","genomic_coding_end")
    transcript_info[order(transcript_info$genomic_coding_start),]

    chromosome_name genomic_coding_start genomic_coding_end 11 17 7571720 7573008 10 17 7573927 7574033 9 17 7576853 7576926 8 17 7577019 7577155 7 17 7577499 7577608 6 17 7578177 7578289 5 17 7578371 7578554 4 17 7579312 7579590 3 17 7579700 7579721 2 17 7579839 7579940 1 17 7590695 7590856

Surprisingly, it gives me different results although some exons are shared. Manual checking with UCSC Genome Browser (hg19) supports the second script. What can be the reason that biomaRt fails to obtain some exons?

ADD COMMENTlink modified 7 weeks ago by Emily_Ensembl18k • written 7 weeks ago by lizaveta0
4
gravatar for Emily_Ensembl
7 weeks ago by
Emily_Ensembl18k
EMBL-EBI
Emily_Ensembl18k wrote:

The difference is the three exons:

  • ENSE00001146308, Exon 1, NA (BioMart) / 7590695 - 7590856 (GTF)
  • ENSE00002667911, Exon 2, 7579839 - 7579912 (BioMart) / 7579839 - 7579940 (GTF)
  • ENSE00003605891, Exon 11, 7572927 - 7573008 (BioMart) / 7571720 - 7573008 (GTF)

With BioMart, you're getting the coding regions of the exons, whereas you're just getting the coordinates of the whole exons from the GTF. If you look at the transcript you'll see that the 5' UTR spans all of exon 1 (so no coding region) and the first third of exon 2 (so the coding region is different to the region of the whole exon). Most of exon 11 is 3' UTR (again, different coordinates for the coding region compared to the whole exon).

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Emily_Ensembl18k

Thank you for the explanation! I believe, as I don't want to depend on external databases (I've been having a lot of trouble while looking for the coordinates of many genes via getBM due to ofter Ensembl mirrors shut-downs) I should use CDS + stop_codon mask from GTF file to get the same sets of coordinates as Ensembl provides.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by lizaveta0
1

You can also use the ensembldb package (https://bioconductor.org/packages/release/bioc/html/ensembldb.html) which will let you use a offline version of the Ensembl data (so no network issues) but in a database format you can construct all sorts of queries for.

ADD REPLYlink written 7 weeks ago by Mike Smith1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 715 users visited in the last hour