Question: Create snakefile based on R script
gravatar for r.tor
14 months ago by
r.tor40 wrote:

I have extracted the Transcription Start Sites (TSS) from hsapiens gene ensembl database and TSS flnaking regions through biomaRt and GenomicRanges. Then all targeted data has been stored in a bed format file. The code looks like this:


mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)

transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end","strand",        "ensembl_gene_id","gene_biotype", "ensembl_transcript_id"),
  filters ="biotype",
  values  =c("protein_coding"),
  mart    =mart)

transcripts[transcripts$strand == 1,]$strand <- "+"
transcripts[transcripts$strand == -1,]$strand <- "-"

trans.range <- makeGRangesFromDataFrame(transcripts, keep.extra.columns = TRUE, ignore.strand =  FALSE, start.field = "transcript_start", end.field = "transcript_end", strand.field = "strand")

 #flank regions
 flank.regions <- promoters(trans.range, upstream=1500, downstream=500, use.names=TRUE)

 #to just keeping the interval ranges (IR) of standard chromosomes:
 stand.flank <-  flank.regions[seqnames(flank.regions) ==     c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22"),]

#change the levels to the only autosomal chr!
seqlevels(stand.flank) <- as.character(seq(1:22))

#split granges based on metadata elements (ensembl_gene_id)

gen.splitted <- split(stand.flank, stand.flank$ensembl_gene_id)

#reduce gen.splitted to get reduced intervals regions for each ensembl_gene_id in one list seperately
gen.reduced <- reduce(gen.splitted)

#to create bed.format file;
gen.bed <- unlist(gen.reduced)

#sort GRanges based on seqnames
gen.bed <- sortSeqlevels(gen.bed)
gen.bed <- sort(gen.bed)

#i want the "ensembl_gene_id" values to be stored in the "names" column of the BED file

 #To write data in a Granges object to a bed File.
 gen.bedfile <- data.frame(seqnames=seqnames(gen.bed),
 scores=c(rep(".", length(gen.bed))), 

 #To put a "." instead of score value in each score column!
 write.table(df, file="foo.bed", quote=F, sep="\t", row.names=F, col.names=F)

#split bed file based on seqnames(chromosome number)
gen.bed.splitted <- split(gen.bedfile, gen.bedfile$seqnames)

#Remove empty zero length rows in a list
gen.bedfile <- gen.bed.splitted[sapply(gen.bed.splitted, nrow)>0]

Now, I'd like to create a snakefile following snakemake mannual to create a small pipeline. Since, I am new in snakemake, I'd appreciate if someone could guide me how would be the input of the first rule where I think it needs to call the hsapiens gene ensembl database. And what about the shell section! because I heard it is not a good idea to transform R script into Snakemake directly, and the Python code would be a better choice. Thanks a lot.

snakefile snakemake python R • 1.4k views
ADD COMMENTlink modified 14 months ago by WouterDeCoster43k • written 14 months ago by r.tor40

your script didn't take any input paramater (correct ?). So I don't understand the point of doing a snakemake file of that ? Is this R script one step in your pipeline ? FYI a simple call Rscript your_script.R is ok to call the script

ADD REPLYlink written 14 months ago by Nicolas Rosewick8.7k

Thank you for the reply. I want to make a workflow as Snakefile looking at my R scripts and yes it would be a begining part of workflow. I have just extracted the target regions considering the applied attributes and filters by getBM function in Biomart package. The source of all, is "hsapiens gene ensembl database". I actually don't know what could be supposed as input in the Snakefile's first rule in this case!! Also, I have tried to put my R scripts directly into the "shell" section of the rule using:

rule tss_ranges:

    mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
    att <- listAttributes(mart)

    {output} = getBM(attributes=c("chromosome_name", "transcript_start",
    "transcript_end","strand", "ensembl_gene_id","gene_biotype",
    filters ="biotype",
    values  =c("protein_coding"),
    mart    =mart)


but it does not work. The corresponding error is about "Unexpected keyword Rscript in rule definition" and even "SyntaxError".

ADD REPLYlink modified 14 months ago • written 14 months ago by r.tor40

What about something like

rule tss_ranges:
        "Rscript your_script.R"
ADD REPLYlink written 14 months ago by Jean-Karim Heriche21k
gravatar for WouterDeCoster
14 months ago by
WouterDeCoster43k wrote:

The easiest would probably be to just save your code as an Rscript (mycode.R), and in the shell part run Rscript mycode.R, optionally with arguments.

ADD COMMENTlink written 14 months ago by WouterDeCoster43k

Hi, you mean is to put all R scripts in one rule? and what about the input section of the rule?

ADD REPLYlink modified 14 months ago • written 14 months ago by r.tor40

I think you serisouly need to read & understand snakemake tutorial

ADD REPLYlink modified 14 months ago • written 14 months ago by H.Hasani880


the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]] (note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]].

Input would be the file you would supply to R for loading and output is expected output.

ADD REPLYlink modified 14 months ago • written 14 months ago by cpad011212k

Which input does that code take?

ADD REPLYlink written 14 months ago by WouterDeCoster43k

just downloading the ordered regions from database!

ADD REPLYlink written 14 months ago by r.tor40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1095 users visited in the last hour