Identify small strings in a larger sequence in R
1
0
Entering edit mode
3.4 years ago
Peter ▴ 20

Hi,

I have a vector containing small strings of interest:

seq_vector <-c ("NET | NST | NVT | NIT | NCT | NYT | NHT | NRT | NNT | NDT | NTT ")

And I would like to find these small strings in larger strings, which are in my .txt file:

A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR
A0A0D9R8B0........ STEVQGMKVNGTKTDNNEGPK
A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK

This file has ~ 3600 entries

I am able to perform this procedure in a .fasta sequence, using the seqinr, tydiverse and biostrings packages. But I am having trouble making these data above.

Does anyone have any ideas and could help me? I'm only interested in sequences that match

I would like to get something like:

A0A0D9S786.....STDQNHSTETPNLAAAVPSSVSVPR...... NHS
A0A0D9R8B0.....STEVQGMKVNGTKTDNNEGPK ............ NGT

Thank you in advance!

R • 888 views
ADD COMMENT
1
Entering edit mode

Read in your data using

scan(, w="") 
cat('each.line.of.your.seq' , grep(seq_vector ,val=T) , '\n', sep= '.', file= 'output.txt' , append=T)
ADD REPLY
0
Entering edit mode

Have you looked at grep() or the stringr package?

ADD REPLY
3
Entering edit mode
3.4 years ago

Example data

df <- structure(list(seq = c("A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR", 
"A0A0D9R8B0........ STEVQGMKVNGTKTDNNEGPK", "A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK"
)), row.names = c(NA, -3L), class = "data.frame")

> df
                                              seq
1   A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR
2        A0A0D9R8B0........ STEVQGMKVNGTKTDNNEGPK
3 A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK

A tidyverse solution.

library("tidyverse")

# I properly formatted your regex and also added a few more seqs since there were no matches with your original example.
seq_vector <- "NET|NST|NVT|NIT|NCT|NYT|NHT|NRT|NNT|NDT|NTT|NHS|LAA|QVA"

matches <- df %>%
  mutate(
    n_matches=str_count(seq, seq_vector),
    matches=str_extract_all(seq, seq_vector)
  ) %>%
  filter(n_matches > 0) %>%
  unnest_wider(matches)

> matches
# A tibble: 2 x 4
  seq                                             n_matches ...1  ...2 
  <chr>                                               <int> <chr> <chr>
1 A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR           2 NHS   LAA  
2 A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK         1 QVA   NA
ADD COMMENT
0
Entering edit mode

Thank you very much for the help rpolicastro! Worked perfectly!!

ADD REPLY
1
Entering edit mode

You can accept the answer (green checkmark) to provide closure to this thread then.

ADD REPLY

Login before adding your answer.

Traffic: 1836 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6