Question: Create snakefile based on R script
gravatar for r.tor
5 months ago by
r.tor40 wrote:

I have extracted the Transcription Start Sites (TSS) from hsapiens gene ensembl database and TSS flnaking regions through biomaRt and GenomicRanges. Then all targeted data has been stored in a bed format file. The code looks like this:


mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)

transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end","strand",        "ensembl_gene_id","gene_biotype", "ensembl_transcript_id"),
  filters ="biotype",
  values  =c("protein_coding"),
  mart    =mart)

transcripts[transcripts$strand == 1,]$strand <- "+"
transcripts[transcripts$strand == -1,]$strand <- "-"

trans.range <- makeGRangesFromDataFrame(transcripts, keep.extra.columns = TRUE, ignore.strand =  FALSE, start.field = "transcript_start", end.field = "transcript_end", strand.field = "strand")

 #flank regions
 flank.regions <- promoters(trans.range, upstream=1500, downstream=500, use.names=TRUE)

 #to just keeping the interval ranges (IR) of standard chromosomes:
 stand.flank <-  flank.regions[seqnames(flank.regions) ==     c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22"),]

#change the levels to the only autosomal chr!
seqlevels(stand.flank) <- as.character(seq(1:22))

#split granges based on metadata elements (ensembl_gene_id)

gen.splitted <- split(stand.flank, stand.flank$ensembl_gene_id)

#reduce gen.splitted to get reduced intervals regions for each ensembl_gene_id in one list seperately
gen.reduced <- reduce(gen.splitted)

#to create bed.format file;
gen.bed <- unlist(gen.reduced)

#sort GRanges based on seqnames
gen.bed <- sortSeqlevels(gen.bed)
gen.bed <- sort(gen.bed)

#i want the "ensembl_gene_id" values to be stored in the "names" column of the BED file

 #To write data in a Granges object to a bed File.
 gen.bedfile <- data.frame(seqnames=seqnames(gen.bed),
 scores=c(rep(".", length(gen.bed))), 

 #To put a "." instead of score value in each score column!
 write.table(df, file="foo.bed", quote=F, sep="\t", row.names=F, col.names=F)

#split bed file based on seqnames(chromosome number)
gen.bed.splitted <- split(gen.bedfile, gen.bedfile$seqnames)

#Remove empty zero length rows in a list
gen.bedfile <- gen.bed.splitted[sapply(gen.bed.splitted, nrow)>0]

Now, I'd like to create a snakefile following snakemake mannual to create a small pipeline. Since, I am new in snakemake, I'd appreciate if someone could guide me how would be the input of the first rule where I think it needs to call the hsapiens gene ensembl database. And what about the shell section! because I heard it is not a good idea to transform R script into Snakemake directly, and the Python code would be a better choice. Thanks a lot.

snakefile snakemake python R • 578 views
ADD COMMENTlink modified 5 months ago by WouterDeCoster39k • written 5 months ago by r.tor40

your script didn't take any input paramater (correct ?). So I don't understand the point of doing a snakemake file of that ? Is this R script one step in your pipeline ? FYI a simple call Rscript your_script.R is ok to call the script

ADD REPLYlink written 5 months ago by Nicolas Rosewick7.7k

Thank you for the reply. I want to make a workflow as Snakefile looking at my R scripts and yes it would be a begining part of workflow. I have just extracted the target regions considering the applied attributes and filters by getBM function in Biomart package. The source of all, is "hsapiens gene ensembl database". I actually don't know what could be supposed as input in the Snakefile's first rule in this case!! Also, I have tried to put my R scripts directly into the "shell" section of the rule using:

rule tss_ranges:

    mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
    att <- listAttributes(mart)

    {output} = getBM(attributes=c("chromosome_name", "transcript_start",
    "transcript_end","strand", "ensembl_gene_id","gene_biotype",
    filters ="biotype",
    values  =c("protein_coding"),
    mart    =mart)


but it does not work. The corresponding error is about "Unexpected keyword Rscript in rule definition" and even "SyntaxError".

ADD REPLYlink modified 5 months ago • written 5 months ago by r.tor40

What about something like

rule tss_ranges:
        "Rscript your_script.R"
ADD REPLYlink written 5 months ago by Jean-Karim Heriche19k
gravatar for WouterDeCoster
5 months ago by
WouterDeCoster39k wrote:

The easiest would probably be to just save your code as an Rscript (mycode.R), and in the shell part run Rscript mycode.R, optionally with arguments.

ADD COMMENTlink written 5 months ago by WouterDeCoster39k

Hi, you mean is to put all R scripts in one rule? and what about the input section of the rule?

ADD REPLYlink modified 5 months ago • written 5 months ago by r.tor40

I think you serisouly need to read & understand snakemake tutorial

ADD REPLYlink modified 5 months ago • written 5 months ago by H.Hasani760


the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]] (note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]].

Input would be the file you would supply to R for loading and output is expected output.

ADD REPLYlink modified 5 months ago • written 5 months ago by cpad011211k

Which input does that code take?

ADD REPLYlink written 5 months ago by WouterDeCoster39k

just downloading the ordered regions from database!

ADD REPLYlink written 5 months ago by r.tor40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1077 users visited in the last hour