Pattern match using R Biostrings
0
0
Entering edit mode
18 months ago
asumani ▴ 70

Hi,

Aim: I am trying to get the positions of all stop codons and type of the stop codon given a DNAstring object or a character string.

stops <- c("TAG","TAA","TGA")
vmatchPattern(stop, stringObj)

I also tried to define stops as "TAA|TAG|TGA" and I know non is supported by vmatchPattern function. Then I tried:

stop1 <-matchPattern(c("TAG"), as(trx, "character")) %>% 
  as.data.frame()
stop2<- matchPattern(c("TAA"), as(trx, "character")) %>% 
  as.data.frame()
stop3<-matchPattern(c("TGA"), as(trx, "character")) %>% 
  as.data.frame()
stops <- rbind(stop1,stop2,stop3)

Outcome below is very much satisfying, I wish I could find a much clever solution.

start  end width seq
1    178  180     3 TAG
2    400  402     3 TAG
3    427  429     3 TAG
4    574  576     3 TAG
5    344  346     3 TAA
6    443  445     3 TAA
7    692  694     3 TAA
8     48   50     3 TGA
9     88   90     3 TGA
10   437  439     3 TGA
11   455  457     3 TGA
12   496  498     3 TGA
13   509  511     3 TGA
14   538  540     3 TGA
15   649  651     3 TGA
16   746  748     3 TGA

Can we find another solution to this problem of mine?

sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_IE.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=en_IE.UTF-8    
 [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_IE.UTF-8   
 [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.64.1   GenomeInfoDb_1.32.4 XVector_0.36.0      IRanges_2.30.1     
 [5] S4Vectors_0.34.0    BiocGenerics_0.42.0 gridExtra_2.3       forcats_0.5.2      
 [9] stringr_1.4.1       dplyr_1.0.10        purrr_0.3.4         readr_2.1.3        
[13] tidyr_1.2.1         tibble_3.1.8        ggplot2_3.3.6       tidyverse_1.3.2    

loaded via a namespace (and not attached):
 [1] lubridate_1.8.0        assertthat_0.2.1       digest_0.6.29         
 [4] utf8_1.2.2             R6_2.5.1               cellranger_1.1.0      
 [7] backports_1.4.1        reprex_2.0.2           evaluate_0.16         
[10] httr_1.4.4             pillar_1.8.1           zlibbioc_1.42.0       
[13] rlang_1.0.6            googlesheets4_1.0.1    readxl_1.4.1          
[16] rstudioapi_0.14        rmarkdown_2.16         labeling_0.4.2        
[19] googledrive_2.0.0      bit_4.0.4              RCurl_1.98-1.8        
[22] munsell_0.5.0          broom_1.0.1            compiler_4.2.1        
[25] modelr_0.1.9           xfun_0.33              pkgconfig_2.0.3       
[28] htmltools_0.5.3        tidyselect_1.1.2       GenomeInfoDbData_1.2.8
[31] fansi_1.0.3            crayon_1.5.2           tzdb_0.3.0            
[34] dbplyr_2.2.1           withr_2.5.0            bitops_1.0-7          
[37] grid_4.2.1             jsonlite_1.8.2         gtable_0.3.1          
[40] lifecycle_1.0.2        DBI_1.1.3              magrittr_2.0.3        
[43] scales_1.2.1           vroom_1.6.0            cli_3.4.1             
[46] stringi_1.7.8          farver_2.1.1           fs_1.5.2              
[49] xml2_1.3.3             ellipsis_0.3.2         generics_0.1.3        
[52] vctrs_0.4.2            tools_4.2.1            bit64_4.0.5           
[55] glue_1.6.2             hms_1.1.2              parallel_4.2.1        
[58] fastmap_1.1.0          yaml_2.3.5             colorspace_2.0-3      
[61] gargle_1.2.1           rvest_1.0.3            knitr_1.40            
[64] haven_2.5.1
biostrings pattern • 1.1k views
ADD COMMENT
0
Entering edit mode

Could you give an example of your original dataset ? dput(stringObj) for instance

ADD REPLY
0
Entering edit mode

Here is an example. I want to do the search on individual transcripts, not the entire object.

>dput(trx)
new("DNAStringSet", pool = new("SharedRaw_Pool", xp_list = list(
    <pointer: (nil)>), .link_to_cached_object_list = list(<environment>)), 
    ranges = new("GroupedIRanges", group = 1L, start = 108951763L, 
        width = 2252L, NAMES = "GABPXX", 
        elementType = "ANY", elementMetadata = NULL, metadata = list()), 
    elementType = "DNAString", elementMetadata = NULL, metadata = list())
ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Is this something wrong? Since it is bioconductor specific question, I thought I could reach a wider range of people. I didn't intend to spam.

ADD REPLY
1
Entering edit mode

Generally it's good etiquette to post only one place at a time since you are using the time of multiple scientists for a single question if you cross post.

ADD REPLY
0
Entering edit mode

Thank you! I will be careful next time.

ADD REPLY
0
Entering edit mode

Gracias amigo / Go raibh maith agat mo chara

ADD REPLY

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6