Question: Identify small strings in a larger sequence in R
0
gravatar for Peter
7 weeks ago by
Peter20
Peter20 wrote:

Hi,

I have a vector containing small strings of interest:

seq_vector <-c ("NET | NST | NVT | NIT | NCT | NYT | NHT | NRT | NNT | NDT | NTT ")

And I would like to find these small strings in larger strings, which are in my .txt file:

A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR
A0A0D9R8B0........ STEVQGMKVNGTKTDNNEGPK
A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK

This file has ~ 3600 entries

I am able to perform this procedure in a .fasta sequence, using the seqinr, tydiverse and biostrings packages. But I am having trouble making these data above.

Does anyone have any ideas and could help me? I'm only interested in sequences that match

I would like to get something like:

A0A0D9S786.....STDQNHSTETPNLAAAVPSSVSVPR...... NHS
A0A0D9R8B0.....STEVQGMKVNGTKTDNNEGPK ............ NGT

Thank you in advance!

R • 175 views
ADD COMMENTlink modified 7 weeks ago by rpolicastro3.3k • written 7 weeks ago by Peter20
1

Read in your data using

scan(, w="") 
cat('each.line.of.your.seq' , grep(seq_vector ,val=T) , '\n', sep= '.', file= 'output.txt' , append=T)
ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by english.server230

Have you looked at grep() or the stringr package?

ADD REPLYlink written 7 weeks ago by Jean-Karim Heriche24k
3
gravatar for rpolicastro
7 weeks ago by
rpolicastro3.3k
Bloomington, IN
rpolicastro3.3k wrote:

Example data

df <- structure(list(seq = c("A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR", 
"A0A0D9R8B0........ STEVQGMKVNGTKTDNNEGPK", "A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK"
)), row.names = c(NA, -3L), class = "data.frame")

> df
                                              seq
1   A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR
2        A0A0D9R8B0........ STEVQGMKVNGTKTDNNEGPK
3 A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK

A tidyverse solution.

library("tidyverse")

# I properly formatted your regex and also added a few more seqs since there were no matches with your original example.
seq_vector <- "NET|NST|NVT|NIT|NCT|NYT|NHT|NRT|NNT|NDT|NTT|NHS|LAA|QVA"

matches <- df %>%
  mutate(
    n_matches=str_count(seq, seq_vector),
    matches=str_extract_all(seq, seq_vector)
  ) %>%
  filter(n_matches > 0) %>%
  unnest_wider(matches)

> matches
# A tibble: 2 x 4
  seq                                             n_matches ...1  ...2 
  <chr>                                               <int> <chr> <chr>
1 A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR           2 NHS   LAA  
2 A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK         1 QVA   NA
ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by rpolicastro3.3k

Thank you very much for the help rpolicastro! Worked perfectly!!

ADD REPLYlink written 7 weeks ago by Peter20
1

You can accept the answer (green checkmark) to provide closure to this thread then.

ADD REPLYlink written 7 weeks ago by GenoMax95k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2527 users visited in the last hour
_