Question: Problem translating ensembl DNA sequence to Protein based on Start location
0
gravatar for niradsp
10 weeks ago by
niradsp0
niradsp0 wrote:

So I downloaded a dataset from ensembl biomart, from the following webpage: https://www.ensembl.org/biomart/martview/

The dataset used was Ensembl Human Genes 97. I selected the following attributes and downloaded the file: Gene stable ID, Transcript stable ID, Protein stable ID,Gene name, Chromosome/scaffold name,Gene description,Strand, Exon region start (bp),Exon region end (bp),Genomic coding start,Genomic coding end, Exon rank in transcript,

I also downloaded the ensembl genome from the link below: ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/

I am going to focus on chromosome 19. The protein ID that I am interested in is ENSP00000496230 (transcript ID ENST00000591545) . The start location is 10984278 (actual location given is 10984279, but I subtracted one from all start locations, as my code requires that I match refseq index).

I used the following code to translate the sequence retreived in DNA format to Protein format.

library(Biostrings)
library(biomaRt)
dat<-readDNAStringSet('NiradData/Testing/chr19.fa.cmp1')
#read in the start and offset
splitup<-unlist(strsplit(as.data.frame(dat)$x,''))

start=10984278
seq=splitup[(start):(start+95)]
seq=paste(seq,collapse='')
translate(DNAStringSet(seq))

I ran BLASTP on the translated sequence above. This is a good match. Blast found the correct protein.

Now, let's look at a different sequence.
The IDs for this sequence are ENST00000595661 and ENSP00000472808. Start location 50031082 (again, subtracted one). The following code does not work:

start=50031082
seq=splitup[(start):(start+9)]
seq=paste(seq,collapse='')
translate(DNAStringSet(seq))

While it is just 9 sequences, this actually produced gibberish. However, if I add one to the start location, the following code produced the correct sequence:

start=50031082
seq=splitup[(start+1):(start+9)]
seq=paste(seq,collapse='')
translate(DNAStringSet(seq))

As you can see above, the transcript starting in location 10984278 did not require adding one, but the one starting in 50031082 required adding one. Is there something I am missing here? Why does one sequence require adding one, but the other one does not?

Edit: I just downloaded data from ncbi for chromosome 19 from below: ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/

Same issue, where the first one matches the protein start location, while the second sequence needs 1 added to the start index in order to produce correct peptide sequence.

sequence • 131 views
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by niradsp0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2104 users visited in the last hour