Question

Snakemake and (external) jupyter notebooks

1

Entering edit mode

2.1 years ago

iibrams07 ▴ 10

I want to integrate snakemake with jupyter notebooks. More precisely, I have a jupyter notebook in which I am doing a machine learning project with python. The output of this notebook is a list of genes which is assigned to an object called by a name. I then use this list in an another jupyter notebook, which this times runs in R, to do some exploratory data analysis as well as RNAseq in which I want to include the list of genes from the first jupyter notebook. I might have a third jupyter notebook in which I might need some objects from the first two jupyter notebooks.

I want to use snakemake to have a formal workflow. By going through the official snakemake tutorial, I can not see how I can perform these tasks. Only in one short session the tutorial is claiming that one can generate a jupyter notebook inside the snakemake environment (internal jupyter notebook). It is not exploring further how such a workflow will be organized in terms of rules. Here are some questions:

I first tried to use Gitpod. But then I realized that one can not import a juypter notebook unless it is imported from a publicly available platform such as GitHub or similar. Should one run snakemake on local machine?
Is the jupyter notebook that snakemake is pretending to generate inside its environment as powerful as the one provided by gitlab, anaconda or google colab? Is there a way to transport the content of my jupyter notebooks (external notebooks) into such an integrated jupyter notebook inside snakemake, or do I have to transfer the code cell by cell into the integrated notebook?
Is it possible to use R inside such an integrated jupyter notebook (internal notebook) in snakemake or rather only python?
In case this is not possible, is it possible to somehow extract the objects that I need from the external jupyter notebooks into the snakemake environement by means of the rules of snakmake and thus connect snakmake to external notebooks in organizing the workflow pipeline?
What would be the typical rules in snakemake to use the external or internal notebooks in terms of input, output and shell commands?

Many thanks for any comment

Snakemake Workflow-management • 1.9k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 2.1 years ago by iibrams07 ▴ 10

2

Entering edit mode

I think you should look at the illustrated integration with Jupyter notebook as just one possible way to combine snakemake and Jupyter. Much the way the documentation offers a way to integrate an external Python script here. If you read through that, you'll realize that following it, you'd create a script you can only use with snakemake. That isn't the way you'd probably develop a script actually. Or use it normally to process one example of your data. You'd probably develop a script to work on one example, maybe providing the data in a data file that you point the script at. Only later then you'd use snakemake to scale use of it. The data carpentries tutorial uses python scripts that way as illustrated in a tutorial here. Note those scripts are being called as if you'd call them on the command line and pass in the input along those lines. Not the complex way in the particular section of the snakemake documentation.
Unfortunately, that Data Carpentries example doesn't include executing a notebook at this time. However, you can run them from the command line using the approach @Jean-Karim Heriche suggests. Jupytext also works on the command line to execute notebooks using the kernel specified inside the notebook when you made it and saved it.
And if you were building a notebook interactively in the course of a snakemake pipeline, I'd suggest using nbformat to do it programmatically. (For more on nbformat, I'd suggest going the the Jupyter Discourse Forum, there I have several examples of using nbformat.) That will serve you in more places because it works with or without snakemake. Then you could have a another rule that executes the generated notebook as discussed in the previous paragraph.

One thing from your outline that I'm not quite following is how the the bridging works between your notebooks. Maybe because right now it doesn't and you were hoping to use snakemake to fill that in? Let's just talk about the first one for now since after that your pipeline gets vague. Is the list of genes your first notebook makes only in the notebook? I'd suggest at the end of your first notebook also saving that list to a text file with each on a separate line as a separate file in addition. One with a name your next rule would use to run. However, if you didn't want to do that to keep things cleaner, nbformat can also be used to collect the output from cells of previously run notebook programmatically. You could use nbformat to just parse out that list from that particular cell. Snakemake rules allow Python code right in them in a run block, and so you can use it right inside your snakefile to do that step in a rule.

ADD REPLY • link 2.1 years ago by Wayne ★ 2.0k

0

Entering edit mode

Can't you just run the notebooks as scripts (e.g. maybe with jupyter nbconvert --execute <notebook>) and use them as steps of your workflow?

ADD REPLY • link 2.1 years ago by Jean-Karim Heriche 27k