Question: Fasta Files - How to search for peptide sequences?
0
gravatar for janainamace
29 days ago by
janainamace10
Brazil
janainamace10 wrote:

Hello,

I have a large file with 105 protein sequences. To obtain this file I used the 'seqinr' function:

library (seqinr)

myfasta <- read.fasta (file = "mydata.fasta", seqtype = "AA", as.string = TRUE, set.attributes = FALSE)

subsetlist <-read.table ("mylist.txt", header = TRUE)

my_fasta_sub <- myfasta [names (myfasta)% in% subsetlist $ ID]

write.fasta (sequences = my_fasta_sub, names = names (my_fasta_sub), nbchar = 80, file.out = "myresylt.fasta")

Next, I would like to look for some sequences of peptides within this list with the 105 sequences, and furthermore, to know which are the two amino acids before the first amino acid of the peptide. Does anyone have any idea how I can do this, please? I'm new to the R environment.

Example of peptide sequences:

DAIPAVEVFEGEPGNK

AVFQLLDSMGPSLPIAEYIASLDRPR

GFCFITFKEEEPVKK

HAFSGGRDTIEEHR

I would like to know what the two amino acids are before, for example:

_ _ DAIPAVEVFEGEPGNK

_ _ AVFQLLDSMGPSLPIAEYIASLDRPR

_ _ GFCFITFKEEEPVKK

_ _ HAFSGGRDTIEEHR

Thanks in advance!

R • 115 views
ADD COMMENTlink modified 29 days ago by Alex Nesmelov110 • written 29 days ago by janainamace10
1

Use zero length assertions in R. Without zero length assertions:

# Protein sequences used in searching
> pepseq=c("AVFQLLDSMGPSLPIAEYIASLDRPR","DAIPAVEVFEGEPGNK")
# Protein sequences to be searched against
> pepseq1=c("OAAAADDAIPAVEVFEGEPGNK","CDDDDDDPDAVFQLLDSMGPSLPIAEYIASLDRPR")
> library(stringr)
> for (i in seq_along(1:length(pepseq))){
+     print (str_sub(str_remove(grep (pepseq[i],pepseq1, value = T), pepseq[i]),-2,-1))
+ }
[1] "PD"
[1] "AD"

if Both the sequences are in the same order i.e order of protein sequences (pepseq1 above) is exactly as query sequences (pepseq above):

pepseq=c("AVFQLLDSMGPSLPIAEYIASLDRPR","DAIPAVEVFEGEPGNK")
pepseq2=c("PDAVFQLLDSMGPSLPIAEYIASLDRPR","ADDAIPAVEVFEGEPGNK")
print(str_sub(str_remove(pepseq2,pepseq),-2,-1))
[1] "PD" "AD"
ADD REPLYlink modified 29 days ago • written 29 days ago by cpad011213k
2
gravatar for Alex Nesmelov
29 days ago by
Alex Nesmelov110
Alex Nesmelov110 wrote:

Hi! If you put sequences of peptides of interest into a vector like peptides_vector = c("DAIPAVEVFEGEPGNK", "AVFQLLDSMGPSLPIAEYIASLDRPR") you can do the following:

library(tidyverse)

 upstream_peptides <-
 map(peptides_vector,

    function(peptide) {

     map2(my_fasta_sub,
          names(my_fasta_sub), 

          function(current_sequence,
                   current_sequence_name) {

                   current_sequence = paste0(current_sequence, collapse ="")
                   coordinates = str_locate_all(current_sequence,  peptide)[[1]]

                   tibble(protein_name =current_sequence_name,
                          peptide_start = coordinates[,1],
                          peptide_end = coordinates[,2],
                          full_match = str_sub(current_sequence,
                                               peptide_start-2,
                                               peptide_end),
                          upstream = str_sub(current_sequence,
                                               peptide_start-2,
                                               peptide_start-1))
                                          })
                    }) %>% 
 reduce(bind_rows) %>% 
 reduce(bind_rows) %>% 
 filter(!is.na(upstream))
ADD COMMENTlink modified 29 days ago • written 29 days ago by Alex Nesmelov110

Really thank you! It worked perfectly and helped me a lot!

ADD REPLYlink written 28 days ago by janainamace10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1606 users visited in the last hour