Question: Match peptides to protein sequences and tell the position in R
0
gravatar for Marie
3.5 years ago by
Marie0
Germany
Marie0 wrote:

Hello,

I was wondering if there is a convenient way to figure out the sequence coverage of a given protein with a list of peptides.

For example I have this protein:

>sp|O00330|ODPX_HUMAN Pyruvate dehydrogenase protein X component, mitochondrial OS=Homo sapiens GN=PDHX PE=1 SV=3
MAASWRLGCDPRLLRYLVGFPGRRSVGLVKGALGWSVSRGANWRWFHSTQWLRGDPIKIL
MPSLSPTMEEGNIVKWLKKEGEAVSAGDALCEIETDKAVVTLDASDDGILAKIVVEEGSK
NIRLGSLIGLIVEEGEDWKHVEIPKDVGPPPPVSKPSEPRPSPEPQISIPVKKEHIPGTL
RFRLSPAARNILEKHSLDASQGTATGPRGIFTKEDALKLVQLKQTGKITESRPTPAPTAT
PTAPSPLQATAGPSYPRPVIPPVSTPGQPNAVGTFTEIPASNIRRVIAKRLTESKSTVPH
AYATADCDLGAVLKVRQDLVKDDIKVSVNDFIIKAAAVTLKQMPDVNVSWDGEGPKQLPF
IDISVAVATDKGLLTPIIKDAAAKGIQEIADSVKALSKKARDGKLLPEEYQGGSFSISNL
GMFGIDEFTAVINPPQACILAVGRFRPVLKLTEDEEGNAKLQQRQLITVTMSSDSRVVDD
ELATRFLKSFKANLENPIRLA

And the following peptides:

HSLDASQGTATGPR
STVPHAYATADCDLGAVLK

VVDDELATR

Is it possible that R tells me the start and end of each peptide in the protein of interest in a new file?

Is it also possible to get the fasta sequence directly from uniprot? 

I need to do that for many proteins and sequences  so I cant do that manually.

Thanks a lot!

sequence R • 2.3k views
ADD COMMENTlink modified 3.2 years ago by Biostar ♦♦ 20 • written 3.5 years ago by Marie0
0
gravatar for Steve Lianoglou
3.5 years ago by
Steve Lianoglou4.9k
US
Steve Lianoglou4.9k wrote:

The Biostrings package has many facilities for pattern/string matching. The Multiple Alignments and Pairwise Sequence Alignments vignettes would be useful to read through as a starting point.

As for getting sequences straight from UniProt, the answer to that is also yes. The UniProt website has a section on accessing its resources programmatically, which you should read through.

There are many ways to interact with a webservice in R. I recently used their webservice to do ID cross-referencing, and utilized the httr package to do so.

Since you want to fetch FASTA/peptide sequences, your queries won't look like this, but the code tidbit below that maps a series of entrez ids to UniProt accession numbers should help to get you started:

library(httr)
entrez <- c('51692', '1478', '26986')
params <- list(
  from='P_ENTREZGENEID',
  to='ACC', 
  format='tab', 
  query=paste(entrez, collapse=' '))
response <- POST('http://www.uniprot.org/mapping/', 
                 body=params, 
                 encode='multipart')
result <- read.table(textConnection(content(response, 'text')),
                     stringsAsFactors=FALSE, header=TRUE)

and result is

    From         To
1| 51692     G5E9W3
2| 51692     Q9UKF6
3|  1478     P33240
4| 26986 A0A024R9C1
5| 26986     P11940

 

 

ADD COMMENTlink written 3.5 years ago by Steve Lianoglou4.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1092 users visited in the last hour