Fasta Files - How to search for peptide sequences?
1
0
Entering edit mode
3.8 years ago
janainamace ▴ 10

Hello,

I have a large file with 105 protein sequences. To obtain this file I used the 'seqinr' function:

library (seqinr)

myfasta <- read.fasta (file = "mydata.fasta", seqtype = "AA", as.string = TRUE, set.attributes = FALSE)

subsetlist <-read.table ("mylist.txt", header = TRUE)

my_fasta_sub <- myfasta [names (myfasta)% in% subsetlist $ ID]

write.fasta (sequences = my_fasta_sub, names = names (my_fasta_sub), nbchar = 80, file.out = "myresylt.fasta")

Next, I would like to look for some sequences of peptides within this list with the 105 sequences, and furthermore, to know which are the two amino acids before the first amino acid of the peptide. Does anyone have any idea how I can do this, please? I'm new to the R environment.

Example of peptide sequences:

DAIPAVEVFEGEPGNK

AVFQLLDSMGPSLPIAEYIASLDRPR

GFCFITFKEEEPVKK

HAFSGGRDTIEEHR

I would like to know what the two amino acids are before, for example:

_ _ DAIPAVEVFEGEPGNK

_ _ AVFQLLDSMGPSLPIAEYIASLDRPR

_ _ GFCFITFKEEEPVKK

_ _ HAFSGGRDTIEEHR

Thanks in advance!

R • 1.6k views
ADD COMMENT
1
Entering edit mode

Use zero length assertions in R. Without zero length assertions:

# Protein sequences used in searching
> pepseq=c("AVFQLLDSMGPSLPIAEYIASLDRPR","DAIPAVEVFEGEPGNK")
# Protein sequences to be searched against
> pepseq1=c("OAAAADDAIPAVEVFEGEPGNK","CDDDDDDPDAVFQLLDSMGPSLPIAEYIASLDRPR")
> library(stringr)
> for (i in seq_along(1:length(pepseq))){
+     print (str_sub(str_remove(grep (pepseq[i],pepseq1, value = T), pepseq[i]),-2,-1))
+ }
[1] "PD"
[1] "AD"

if Both the sequences are in the same order i.e order of protein sequences (pepseq1 above) is exactly as query sequences (pepseq above):

pepseq=c("AVFQLLDSMGPSLPIAEYIASLDRPR","DAIPAVEVFEGEPGNK")
pepseq2=c("PDAVFQLLDSMGPSLPIAEYIASLDRPR","ADDAIPAVEVFEGEPGNK")
print(str_sub(str_remove(pepseq2,pepseq),-2,-1))
[1] "PD" "AD"
ADD REPLY
2
Entering edit mode
3.8 years ago
Alex Nesmelov ▴ 200

Hi! If you put sequences of peptides of interest into a vector like peptides_vector = c("DAIPAVEVFEGEPGNK", "AVFQLLDSMGPSLPIAEYIASLDRPR") you can do the following:

library(tidyverse)

 upstream_peptides <-
 map(peptides_vector,

    function(peptide) {

     map2(my_fasta_sub,
          names(my_fasta_sub), 

          function(current_sequence,
                   current_sequence_name) {

                   current_sequence = paste0(current_sequence, collapse ="")
                   coordinates = str_locate_all(current_sequence,  peptide)[[1]]

                   tibble(protein_name =current_sequence_name,
                          peptide_start = coordinates[,1],
                          peptide_end = coordinates[,2],
                          full_match = str_sub(current_sequence,
                                               peptide_start-2,
                                               peptide_end),
                          upstream = str_sub(current_sequence,
                                               peptide_start-2,
                                               peptide_start-1))
                                          })
                    }) %>% 
 reduce(bind_rows) %>% 
 reduce(bind_rows) %>% 
 filter(!is.na(upstream))
ADD COMMENT
0
Entering edit mode

Really thank you! It worked perfectly and helped me a lot!

ADD REPLY

Login before adding your answer.

Traffic: 2870 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6