So I downloaded a dataset from ensembl biomart, from the following webpage: https://www.ensembl.org/biomart/martview/
The dataset used was Ensembl Human Genes 97. I selected the following attributes and downloaded the file: Gene stable ID, Transcript stable ID, Protein stable ID,Gene name, Chromosome/scaffold name,Gene description,Strand, Exon region start (bp),Exon region end (bp),Genomic coding start,Genomic coding end, Exon rank in transcript,
I also downloaded the ensembl genome from the link below: ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
I am going to focus on chromosome 19. The protein ID that I am interested in is ENSP00000496230 (transcript ID ENST00000591545) . The start location is 10984278 (actual location given is 10984279, but I subtracted one from all start locations, as my code requires that I match refseq index).
I used the following code to translate the sequence retreived in DNA format to Protein format.
library(Biostrings)
library(biomaRt)
dat<-readDNAStringSet('NiradData/Testing/chr19.fa.cmp1')
#read in the start and offset
splitup<-unlist(strsplit(as.data.frame(dat)$x,''))
start=10984278
seq=splitup[(start):(start+95)]
seq=paste(seq,collapse='')
translate(DNAStringSet(seq))
I ran BLASTP on the translated sequence above. This is a good match. Blast found the correct protein.
Now, let's look at a different sequence.
The IDs for this sequence are ENST00000595661 and ENSP00000472808. Start location 50031082 (again, subtracted one). The following code does not work:
start=50031082
seq=splitup[(start):(start+9)]
seq=paste(seq,collapse='')
translate(DNAStringSet(seq))
While it is just 9 sequences, this actually produced gibberish. However, if I add one to the start location, the following code produced the correct sequence:
start=50031082
seq=splitup[(start+1):(start+9)]
seq=paste(seq,collapse='')
translate(DNAStringSet(seq))
As you can see above, the transcript starting in location 10984278 did not require adding one, but the one starting in 50031082 required adding one. Is there something I am missing here? Why does one sequence require adding one, but the other one does not?
Edit: I just downloaded data from ncbi for chromosome 19 from below: ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
Same issue, where the first one matches the protein start location, while the second sequence needs 1 added to the start index in order to produce correct peptide sequence.