how to extract AA sequence for each exons using biomaRt
0
0
Entering edit mode
3 months ago
kng ▴ 10

I am trying to extract peptide sequences for each exon for my gene using biomaRt. I managed to extract DNA sequences for each exon but struggled to extract amino acid sequences for each exon separately. Below is the code I used in R. Please advise!

library(biomaRt)
ensembl <- useMart("ensembl")
human_ensembl <- useDataset("hsapiens_gene_ensembl", ensembl)
ensemble_data <- getBM(attributes=c("ensembl_transcript_id",  "gene_exon", "ensembl_exon_id", "exon_chrom_start","exon_chrom_end", "rank", "strand", "peptide"),
filters=c("ensembl_transcript_id"),
values="ENST00000380152",
mart=human_ensembl)

ensembl AAsequence sequence biomart • 534 views
0
Entering edit mode

Hi! What are you trying to achieve and how are you planing to generate those peptide sequences belonging to the junction of the pair of exons belonging to the same protein? Maybe (not sure of your goal but) would make more sense to extract the full protein sequence belonging to ENST00000380152 ID and run a sliding window of the length of your desired peptide size?

0
Entering edit mode

Hi, iraun Thank you for your reply. If the sequence belongs to the intron-exon junction of two exons it can be part of both exons. But I need the peptide sequences for each exon separately. If I remove "peptide" from my attribute list in the above code, I can get DNA sequence for 27 exons for ENST00000380152 and each is of a different length, so sliding the window approach, as you advised, on the full length of the protein sequence would not help. There must be a smart way to extract those using biomaRt or some other tool?

0
Entering edit mode

Maybe this helps a bit:

transcriptID <- "ENST00000380152"

cdsAnnot <- getBM(attributes = c("ensembl_transcript_id","ensembl_peptide_id","strand","gene_biotype","cds_start","cds_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end"),
filters = 'ensembl_transcript_id',
values = transcriptID,
mart = human_ensembl)

peptide_seqs <- getSequence(id = cdsAnnot[,"ensembl_exon_id"],
type = "ensembl_exon_id",
seqType = "peptide",
mart = human_ensembl)


It does however not give you the sequences for each exon seperately, but only for those that are part of an actual translated transcript combined.

If you want to translate any exon no matter what, then Biostrings::translate() will probably be of use, if you can somehow keep the reading frame in sync with the help of the exon coordinates in cdsAnnot.

0
Entering edit mode

Hi kng,

I would advise using the Ensembl REST API to get the exon sequence data: http://rest.ensembl.org/

You can use the Lookup endpoint with the 'expand' optional parameter to get the IDs of the exons (ENSE#) given a transcript ID (ENST#). Then use the Sequence endpoint to retrieve the protein sequences for each of the exons.

0
Entering edit mode

Are there more options for customization of the response when using the API directly instead from BioMart?

1
Entering edit mode

Hi Matthias, The data available for each of the REST API endpoints can be customised using the optional parameters. The available parameters for each endpoint can be found through the documentation pages. E.g: http://rest.ensembl.org/documentation/info/lookup

When using the REST API, the idea is that you can write scripts (in any language) around the REST API endpoints to pull out specific bits of the output, process it in custom ways and feed it into other platforms.

You can find out more in our online course: https://www.ebi.ac.uk/training/online/courses/ensembl-rest-api/