Question: Extracting a specific string pattern from a list of objects
0
gravatar for dodausp
8 months ago by
dodausp110
Denmark/Copenhagen/BRIC
dodausp110 wrote:

Hi,

Here is a recurrent problem I face from time to time, specially because I would rather have object names that resemble my file names other than creating a list with different names for those given files. It is just easier for me to track down a file in case of troubleshooting. So, my question is how can I extract a pattern on my object rather than a specific string of characters? For example:

vcfs <- list.files()

vcfs
[1] "OV-TCGA-05-1456-01.vcf"   "OV-TCGA-05-4578-01.vcf"   "OV-TCGA-08-5666-01.vcf"   "LUSC-TCGA-10-5684-01.vcf" "LUAD-TCGA-02-6574-01.vcf"

So, as you can see, the first part of each file defines the cohort type (OV, LUSC, LUAD) and the rest after "TCGA" is unique to each one of them. I would like to (1) remove the hyphens, (2) keep the cohort name, and (3) keep the 6 digits coming after "TCGA". So it should look like this:

"OV051456", "OV054578", "OV085666", "LUSC105684", "LUAD026574"

Now, I always struggle using those symbols (*, ., ?, \", "") to extract a string from a character object. So, if in addition any of you could also recommend me where to find a good tutorial on those, I would truly appreciate it. And sorry by the simple question. I am not a hardcore bioinformatician. And I love how this community is always so engaging and helpful.

So, thanks a lot in advance!

Cheers,

Douglas

tcga snp subsetting data string R • 290 views
ADD COMMENTlink modified 8 months ago by jweile20 • written 8 months ago by dodausp110

I also find regex confusing, and often refer to this site to help me:

https://regexr.com/

It has good information, cheatsheets, guides, and a live editor in which you can play around with your expressions.

ADD REPLYlink written 8 months ago by Russ450
3
gravatar for Russ
8 months ago by
Russ450
Ontario Veterinary College, University of Guelph, Guelph, Ontario, Canada
Russ450 wrote:

There's probably a nifty regex one liner that will accomplish the task more efficiently, but my strategy is simple and works:

vcf1 <- gsub("-TCGA-", "", vcf)
vcf2 <- gsub("-[0-9][0-9].vcf", "", vcf1)
vcf3 <- gsub("-", "", vcf2)

> vcf3
[1] "OV051456"   "OV054578"   "OV085666"   "LUSC105684" "LUAD026574"
ADD COMMENTlink written 8 months ago by Russ450
1

That was really helpful! It worked just fine for all the files I had, with no errors. Also, I liked the way you put it up. It made it very easy for me to understand each step of the routine.

Thanks a lot, Russ!

ADD REPLYlink written 8 months ago by dodausp110
2
gravatar for jweile
8 months ago by
jweile20
jweile20 wrote:

Having run into this kind of problem several times as well, I have since written a little helper function for these types of situation:

#' Extract regex groups (local)
#' 
#' Locally excise regular expression groups from string vectors.
#' I.e. only extract the first occurrence of each group within each string.
#' 
#' @param x A vector of strings from which to extract the groups.
#' @param re The regular expression defining the groups
#' @return A \code{matrix} containing the group contents, 
#'      with one row for each element of x and one column for each group.
#' @keywords regular expression groups
#' @export
extract.groups <- function(x, re) {
    matches <- regexpr(re,x,perl=TRUE)
    start <- attr(matches,"capture.start")
    end <- start + attr(matches,"capture.length") - 1
    do.call(cbind,lapply(1:ncol(start), function(i) {
        sapply(1:nrow(start),function(j){
            if (start[j,i] > -1) substr(x[[j]],start[j,i],end[j,i]) else NA
        })
    }))
}

For your specific problem, it can be used as follows:

> groups <- extract.groups(vcfs,"^(\\w+)+-TCGA-(\\d{2})-(\\d{4})-01.vcf$")
> groups
     [,1]   [,2] [,3]  
[1,] "OV"   "05" "1456"
[2,] "OV"   "05" "4578"
[3,] "OV"   "08" "5666"
[4,] "LUSC" "10" "5684"
[5,] "LUAD" "02" "6574"
> output <- apply(groups,1,paste,collapse="")
> output
[1] "OV051456"   "OV054578"   "OV085666"   "LUSC105684" "LUAD026574"
ADD COMMENTlink written 8 months ago by jweile20
1

I tried your routine and it worked just nicely as well. And really nice that you also went all the way to explain what the function does. And I was glad to know that I am not alone on this "challenging" issue.

This community is ace! Thanks a lot, jweile!

ADD REPLYlink written 8 months ago by dodausp110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1650 users visited in the last hour