Question

Fasta Files - How to search for peptide sequences?

0

Entering edit mode

3.8 years ago

janainamace ▴ 10

Hello,

I have a large file with 105 protein sequences. To obtain this file I used the 'seqinr' function:

library (seqinr)

myfasta <- read.fasta (file = "mydata.fasta", seqtype = "AA", as.string = TRUE, set.attributes = FALSE)

subsetlist <-read.table ("mylist.txt", header = TRUE)

my_fasta_sub <- myfasta [names (myfasta)% in% subsetlist $ ID]

write.fasta (sequences = my_fasta_sub, names = names (my_fasta_sub), nbchar = 80, file.out = "myresylt.fasta")

Next, I would like to look for some sequences of peptides within this list with the 105 sequences, and furthermore, to know which are the two amino acids before the first amino acid of the peptide. Does anyone have any idea how I can do this, please? I'm new to the R environment.

Example of peptide sequences:

DAIPAVEVFEGEPGNK

AVFQLLDSMGPSLPIAEYIASLDRPR

GFCFITFKEEEPVKK

HAFSGGRDTIEEHR

I would like to know what the two amino acids are before, for example:

_ _ DAIPAVEVFEGEPGNK

_ _ AVFQLLDSMGPSLPIAEYIASLDRPR

_ _ GFCFITFKEEEPVKK

_ _ HAFSGGRDTIEEHR

Thanks in advance!

R • 1.6k views

ADD COMMENT • link updated 3.8 years ago by Alex Nesmelov ▴ 200 • written 3.8 years ago by janainamace ▴ 10

1

Entering edit mode

Use zero length assertions in R. Without zero length assertions:

# Protein sequences used in searching
> pepseq=c("AVFQLLDSMGPSLPIAEYIASLDRPR","DAIPAVEVFEGEPGNK")
# Protein sequences to be searched against
> pepseq1=c("OAAAADDAIPAVEVFEGEPGNK","CDDDDDDPDAVFQLLDSMGPSLPIAEYIASLDRPR")
> library(stringr)
> for (i in seq_along(1:length(pepseq))){
+     print (str_sub(str_remove(grep (pepseq[i],pepseq1, value = T), pepseq[i]),-2,-1))
+ }
[1] "PD"
[1] "AD"

if Both the sequences are in the same order i.e order of protein sequences (pepseq1 above) is exactly as query sequences (pepseq above):

pepseq=c("AVFQLLDSMGPSLPIAEYIASLDRPR","DAIPAVEVFEGEPGNK")
pepseq2=c("PDAVFQLLDSMGPSLPIAEYIASLDRPR","ADDAIPAVEVFEGEPGNK")
print(str_sub(str_remove(pepseq2,pepseq),-2,-1))
[1] "PD" "AD"

ADD REPLY • link 3.8 years ago by cpad0112 21k

score 2 · Answer 1 · 2020-07-07

Hi! If you put sequences of peptides of interest into a vector like peptides_vector = c("DAIPAVEVFEGEPGNK", "AVFQLLDSMGPSLPIAEYIASLDRPR") you can do the following:

library(tidyverse)

 upstream_peptides <-
 map(peptides_vector,

    function(peptide) {

     map2(my_fasta_sub,
          names(my_fasta_sub), 

          function(current_sequence,
                   current_sequence_name) {

                   current_sequence = paste0(current_sequence, collapse ="")
                   coordinates = str_locate_all(current_sequence,  peptide)[[1]]

                   tibble(protein_name =current_sequence_name,
                          peptide_start = coordinates[,1],
                          peptide_end = coordinates[,2],
                          full_match = str_sub(current_sequence,
                                               peptide_start-2,
                                               peptide_end),
                          upstream = str_sub(current_sequence,
                                               peptide_start-2,
                                               peptide_start-1))
                                          })
                    }) %>% 
 reduce(bind_rows) %>% 
 reduce(bind_rows) %>% 
 filter(!is.na(upstream))