Question: Any way to identify list of bed file with complicated pattern?
0
gravatar for Jurat Shahidin
2.3 years ago by
Italy
Jurat Shahidin40 wrote:

I have list of bed files that needed to used in my devel package as an external data. Because these bed files has rather complicated pattern, and I intend to get pair of bed files with same pattern for my packages use (i.e., pair of bed file with same pattern could be two or three in my data set, please look at my example). However, I gave my shot to make this happen, but my solution is not efficient and I got error instead. Can anyone point me out how to make this happen easily? How to efficiently manipulate file names with complicated pattern in R ? How to achieve my desired function output ? Any quick solution to solve this issue? Thanks a lot :)

Here is my external data set in my packages, list of bed files in extdata :

myPkg
    - inst
         - extdata
             - wgEncodeOpenChromChipK562CmycAlnRep1.bed
             - wgEncodeOpenChromChipK562CmycAlnRep2.bed
             - wgEncodeOpenChromChipK562CmycAlnRep3.bed
             - wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep2.bed
             - wgEncodeSydhTfbsK562CmycIfna30StdAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycIfna30StdAlnRep2.bed
             - wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep2.bed
             - wgEncodeSydhTfbsK562CmycIggrabAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycIggrabAlnRep2.bed
             - wgEncodeSydhTfbsK562CmycStdAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycStdAlnRep2.bed
    - R

Here is part of my solution :

getExtDat <- function() {
  dir <- system.file("extdata", package="myPkg")
  files <- list.files(dir)
  # FIXME : I got error, make it more dynamic
  #patt <- gsub(pattern='wgE\\d+_(\\w+_\\w+)_.*', replacement='\\1',files)
  res <- paste(dir, files, sep="/")
  res <- as.list(res)
  return(res)
}

print(getExtDat)

And getExtdat object passed to this function :

myFunc <- function(getExtDat) {
  files <- list.files(getExtDat, full.names = TRUE, "\\.bed$")
  readMe <- lapply(files, import.bed)
  return(readMe)
}

Edit example :

let getExtDat find pair of bed files with same pattern (

for example, I need these bed files as an input for myFunc :

wgEncodeOpenChromChipK562CmycAlnRep1.bed, 
wgEncodeOpenChromChipK562CmycAlnRep2.bed, wgEncodeOpenChromChipK562CmycAlnRep3.bed

or I need these bed files as an input for myFunc :

wgEncodeSydhTfbsK562CmycIggrabAlnRep1.bed
wgEncodeSydhTfbsK562CmycIggrabAlnRep2.bed

or I need these bed files as an input for myFunc:

wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep1.bed
wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep2.bed

are my objective to find by getExtDat, instead of loading all bed files ), and pass getExtDat to myFunc.

How to make getExtDat function to find desired paired bed files available for myFunc as input ? How to correctly identify desired pattern of bed files more dynamically ? Any way to more programmatic solution for finding desired bed files with complicated pattern ? Any idea please ?

sequencing bed chip-seq R • 700 views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Jurat Shahidin40
2

I do not know much R, but what I would do is something like,

li <- c("wgEncodeOpenChromChipK562CmycAlnRep1.bed", "wgEncodeOpenChromChipK562CmycAlnRep2.bed", "wgEncodeOpenChromChipK562CmycAlnRep3.bed", "wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep1.bed", "wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep2.bed")

replicates <- unique(gsub("wgEncode|Rep[0-9].bed", "", li))

replicates
[1] "OpenChromChipK562CmycAln"     "SydhTfbsK562CmycIfna6hStdAln"

#Then a for loop for each replicate,

grep(replicates[1], li, value=TRUE)
[1] "wgEncodeOpenChromChipK562CmycAlnRep1.bed"
[2] "wgEncodeOpenChromChipK562CmycAlnRep2.bed"
[3] "wgEncodeOpenChromChipK562CmycAlnRep3.bed"

grep(replicates[2], li, value=TRUE)
[1] "wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep1.bed"
[2] "wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep2.bed"
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by geek_y9.3k
1

@Goutham Atla: That's what I was also thinking... (I think there shouldn't be spaces around '|' in the gsub pattern)

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by dariober9.9k

Thanks, edited it.

ADD REPLYlink written 2.3 years ago by geek_y9.3k

@Goutham Atla : I edited my example as well, Is that possible to make your solution more programmatic ? I am wondering any dynamic solution is possibly happen or not. Thanks a lot :)

ADD REPLYlink written 2.3 years ago by Jurat Shahidin40

Dear Goutham:

Thanks for your quick respond here. Yes, name of bed files is rather long, I designed function for accepting list of bed files (for example, myFunc accept at most tow or three peak files with same pattern as an input, that I showed in my example). Could you have more general solution? what if I choose other bed files with same pattern in my external data ? Is that possible to make your solution more programmatic ? Plus, here is my devel package hosted on github. Thanks a lot :)

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Jurat Shahidin40

I don't think there can be a general solution. It depends on what defines the pattern that captures replicates, in your example stripping "Rep.*" is enough, but can you be sure that it is sufficient for other files? If the answer is no, than you could check that the number of files capture by each pattern is, say, between 2 and 4, but it may not be very robust.

I guess very much in theory one could have some probabilistic, machine learning strategy to find patterns discriminating groups but I'm pretty sure you don't want to go there for the job you describe! For example one could compute the edit distance between pairs of file names and cluster names on the base of that similarity distance.

ADD REPLYlink written 2.3 years ago by dariober9.9k

Dear dariober:

Thanks for your concern on my question. If you look at external data in my devel package , I must choose any three or two bed files for my package use. I think Goutham's solution so close to generate more robust solution. Maybe you could give me more interesting approach to solve this problem. Thank you :)

Best regards :

Jurat

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Jurat Shahidin40

Your package seems very interesting by the way, keep it up! Combining weak peaks is something I have been wondering about myself...

ADD REPLYlink written 2.3 years ago by dariober9.9k

Dear dariober :

Thanks for you are being supportive, appreciated. Yes, it could be interesting packages will be hosted in Bioconductor. Any possible contribution from Biostar community is very appreciated. Thanks again.

Best regards :

Jurat

ADD REPLYlink written 2.3 years ago by Jurat Shahidin40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1721 users visited in the last hour