In short: I developed a program as a GUI to help set up and execute RNA-Sequencing analysis via Kallisto and DESeq2, starting with fastq files, either downloaded with directions from the program, or provided by the user. The program provides customized Bash and R scripts to automate the process (based on your own file structure), providing output including transcript level abundances, a normalized counts matrix, a list of up- and down-regulated genes, hierarchical clustering heatmaps, PCA plots, and bi-modality analysis via the SIBER and DEXUS packages. Check out my Github repo and follow instructions on the ReadMe. Please see an example of the pipeline output here.
And for a little more background...
Without knowledge of the Bash and R programming languages, it can be confusing to newcomers how to set up the proper file structure and how to execute the proper commands that are necessary for using Bioinformatics packages such as Kallisto and DESeq2.
I set out to help those that are new to the RNA-Sequencing world by creating an application to automate the process of file structure creation and command execution that are necessary for downloading fastq files via the SRA toolkit, sequence alignment via Kallisto, and differential expression analysis via DESeq2. The user only needs a path to their working directory, the sample names (and NCBI accession numbers if downloading fastq files), and whether the reads are paired or single end. The user is provided with a set of detailed instructions and customized code snippets that allow for easy execution of required commands in both Bash and R. In the end, a pipeline for RNA-Sequencing analysis is provided to produce the output described above.
To begin, visit https://github.com/anthony-knox/rna-sequencing-pipeline-generator and proceed with the directions in the ReadMe. If you have any questions, don't hesitate to ask in this thread or by contacting me via my Github profile.
UCSF, Department of Pathology
Thank you for putting this together, Anthony.
I agree it is useful to help provide autonomy for scientists with limiting coding experience. I think it is also helpful to provide paths for people to gradually become more familiar with open-source software (which I think is part of what you are trying to achieve).
However, while I don’t want to be overly negative, I thought that it might be necessary for me to leave a comment mentioning that I have kind of shifted away from the “pipeline” idea (even as someone who developed COHCAP, with "pipeline" in the name). While I am still trying to figure out exactly what to recommend people to do for their analysis support / training (and I still think there are functions in the COHCAP Bioconductor package that are useful for DNA methylation projects), I am concerned that people can underestimate the amount of time for analysis (and a substantial amount of time should be expected to critically assess datasets).
For example, if I use templates to help with analysis, I have to go into the code and modify something for pretty much every project (and, at least for me, this pretty much has to be code that you wrote yourself, in order to be able to do this for any given step of the code).
In your example documentation (which I think is good to provide), it seems like this may be more like a “template” than a “pipeline” (since there are chunks of R code). So, maybe this isn’t a huge deal. However, I apologize if I am overlooking something important (for example, I was able to successfully run the GUI in the .jar file, but I didn't run it with test samples).
However, I do think people need to test different methods for every project. So, while you provide useful documentation about this set of steps for analysis, I think they should do things like test edgeR/limma-voom/DESeq2 gene lists for every project (at least for a few comparisons). Also, I think having alignments to visualize is an important option to have for the troubleshooting process.
Nevertheless, thank you very much for your contribution. I apologize for the long comment, and I hope everything goes well for your research projects!
I completely agree with your stance on pipelines and the necessity for people to really delve into their own specific needs for their own specific project. While I understand that the downside of strictly following a given pipeline is that you are confined in terms of setting parameters and producing results, I believe that my program allows beginners a place to start with their analysis.
My intentions were to provide an example of the necessary file structure as well as snippets of code in Bash and R that work with said file structure. When I was just beginning to delve into this area of research myself, it took a long time for me to familiarize myself with the coding requirements used in these Bioinformatics packages. I agree that there is a necessity to learn more in depth about these languages, but I think it is easier to learn in the context of Bioinformatics analysis with a concrete example of given code that works with your own local file structure. Additionally, I wanted to provide the resources and references needed to follow the Kallisto and DESeq2 vignettes, such as mentioning specifically where and how to download a reference genome. I hope that some of the Bash scripts that I provide will also be translatable as a template for other related tasks.
My ultimate goal was to help other beginners get their feet wet so that they could modify my pipeline however it fit best with their own specific needs. And I understand that Kallisto and DESeq2 may not be the best avenue for a given project, but after gaining experience with one method of analysis, I hope it would be easier to begin exploring other methods, like you mentioned.
Thank you very much for your comments - I think you bring up an important perspective about the pros and cons of using pipelines and templates. I wish you the best with your projects as well!
Great - I think we are pretty much on the same page.
Would it be possible to create a smaller RNA-Seq demo dataset? For example, perhaps the MiSeq samples from SRP012607 (maybe even smaller, if possible)?
In other words, I have started some testing with the following set of samples:
However, it looks like the process is semi-automated (so, some files/folders are generated, but you need the extra information in the example to run the appropriate commands, which you have to execute separately from the program).
So, I think this is an interesting way to help get things configured, but I do think users probably should expect some non-trivial amount of time to get everything set up (and understand how the steps are all connected).
Thank you again!
Not directly related to present thread but @Wouter had created a similar framework called
DEA.R. In case you were interested in such pipelines. Note: It is not currently under active maintenance.
Thank you for the reference!
I apologize for potentially being a bad example - I was adding a comment because I didn't want to over-emphasize my points/questions (since only the last part really related to using the code, and I didn't know I would provide that input in advance).
However, to make your question more clear, I think it really should be an "answer" (even though it really is a question, that should help in getting responses / comments to your specific point).
Thank you very much for the suggestion - I was looking for a dataset that would be quicker to download for demonstration purposes. I am running through the program with this dataset right now and will most likely update the ReadMe to include it as a demo. But you are correct, the fastq-dump is the most time-consuming process, so setup takes some time - I will give an estimate as to how long this dataset will take to download so the user has an idea.
As for the example pipeline output that I provide, it is less useful as a standalone since throughout the progression of the program, the pipeline output file has instructions appended to it periodically, files are created and moved around, and the file structure changes based on the action of pressing buttons within the program as well as executing shell scripts. So the user will have to run through the entirety of the program on their own and follow instructions one by one. Hopefully the example pipeline would give a better idea of the overall purpose of the program though.
What is the reference genome that you use for this dataset? I tried Kallisto alignment with both Human GRCh37 and GRCh38 (from Ensembl cDNA) and I am getting read counts of 0 for all transcripts.