Question: Bulk download introns, exons, and UTR regions from Ensembl for gene prediction training set
gravatar for Katherine Huang
2.0 years ago by
Katherine Huang0 wrote:

Hi, I would like to download labeled FASTA sequences of introns, exons, 5' UTR regions, and 3' UTR regions from a nonredundant set of human genes.

Ensembl allows me to do this for an individual gene by going to a page on an individual transcript variant (;g=ENSG00000139618;r=13:32889611-32973805;t=ENST00000544455) and clicking Download Sequence > FASTA.

Is there a way to automatically download a file like that for several thousand genes? I would like them all to be human (or at least mammalian) and protein-coding. Biomart seems to be down right now, and I'm willing to try to use the Perl, REST, or SQL APIs, but I have no experience with any of those, so some direction would be appreciated.

Ultimately I want a database of DNA sequences labeled as intron, exon, 5' UTR, or 3' UTR. If other databases (e.g. RefSeq) can provide it, that would be great too. Thanks!

ADD COMMENTlink modified 2.0 years ago by Brian Gudenas90 • written 2.0 years ago by Katherine Huang0
gravatar for Brian Gudenas
2.0 years ago by
United States
Brian Gudenas90 wrote:

Check out the biomaRt R package, specifically the getSequence function which allows you to use a list of gene identifiers (Ensembl, or entrezgene) to retrieve sequences of interest by changing the seqType parameter (cdna, 3utr, 5utr, gene_exon, gene_intron, etc..)

mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

Ensembl_IDs = c(ENSG00000139618, ENSG00000128731)

seqs = biomaRt::getSequence(id = Ensembl_IDs, 
           seqType = "gene_exon", 
           mart = mart)
ADD COMMENTlink written 2.0 years ago by Brian Gudenas90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 757 users visited in the last hour