Forum: NGS RNA-Seq Analysis Pipeline Tophat vs STAR
0
gravatar for mikefeixu
7 days ago by
mikefeixu10
mikefeixu10 wrote:

It appears STAR is 2X faster than Tophat.

Code and examples are available on Github: https://github.com/mikefeixu/RNA-Seq

rna-seq tool forum • 203 views
ADD COMMENTlink modified 7 days ago by Istvan Albert ♦♦ 86k • written 7 days ago by mikefeixu10
2

that's not really new info. STAR is a modern and well maintained aligner, TopHat is over a decade old and not maintained anymore for years. I'm only surprised it's only 2x faster (sic) , I have the impression it's even faster than that.

What is the point of that github repo?

ADD REPLYlink modified 7 days ago • written 7 days ago by lieven.sterck9.4k
1

The pipeline in the github gives the code for RNA-Seq anaysis from scratch in processing rawdata to data visualization in paper ready heatmaps. It might be helpful for beginners who wants to try Tophat and STAR in their RNA-Seq analysis.

The Tophat pipeline takes 8~12 hours for each sample, while the STAR pipeline takes only 2~3 hours for each sample.

ADD REPLYlink written 7 days ago by mikefeixu10
1

OK, and complementing the comments from _r_am , please don't point beginners to using TopHat in any way !! (it should not be used anymore, even the authors of that tool are putting this out on the internet, it has been replaced by much more efficient and up-to-date tools)

ADD REPLYlink written 7 days ago by lieven.sterck9.4k

Your code would actually confuse beginners, IMO. You have a lot going on in not-so-well documented code with unnecessary steps that is neither representative of a tool's simple usage, nor a well-documented step-by-step pipeline. The repo looks like a personal code stash or one created for an assignment, not something that you built to share with anyone.

ADD REPLYlink written 7 days ago by _r_am32k
2

Hi mikefeixu, sorry if this did not turn out the way you intended, please do not feel discouraged. In general participation is welcome, but please make sure that you check first whether a post is meaningful, in this case unfortunately not because the result is expected and already published multiple times.

As the Github repository is deleted now (it seems), and if that is the case, then please consider deleting this thread here by clicking moderate and then delete as without the repo it has no meaning. Thank you!

ADD REPLYlink modified 7 days ago • written 7 days ago by ATpoint44k

Thank you very much for the feedback. I just updated the github repository, cleaned up a bit per your feedback. Would you mind letting me know if any improvements those I can make?

ADD REPLYlink modified 7 days ago • written 7 days ago by mikefeixu10

The link is still dead for me. Make sure is's not private.

ADD REPLYlink modified 7 days ago • written 7 days ago by ATpoint44k

Would you minding trying again?

ADD REPLYlink written 7 days ago by mikefeixu10

The plus side is that this code would be useful to someone looking to replicate or debug your analysis in your team, but not outside of it. I'd rather have the code you wrote for the analysis than not have any code at all, but that isn't saying much.

Here's a tip: Write comments - lots of them. In fact, instead of an R script, use R notebooks. write your thoughts and instructions in plain text or markdown, and write the code in code blocks.

As for the SGE part, it looks like your execution uses the TASK ID for purposes related to the data (sample=$(awk "NR==${SGE_TASK_ID}" $sourcedir/sample_list.txt)), which is dangerous territory. You're using a coincidence as a dependency and that _will_ break all hell loose when that coincidence ceases to exist. Instead of this, use either a loop or parallel to generate SLURM scripts that you can then subsequently run. Even better, use snakemake or a similar workflow management tool so everything is replicable.

ADD REPLYlink modified 7 days ago by ATpoint44k • written 7 days ago by _r_am32k

Are you planning on sharing the data files as well (the FASTQ, sample list file, etc)? Your R code is dependent on a personal dropbox folder, which makes it sort of unusable for anyone else.

ADD REPLYlink modified 7 days ago • written 7 days ago by _r_am32k

Thank you so much for your patience. The FASTQ files are too big to share. I just added the counts file instead. My purpose is to make the code independent of any project. Users need to provide their own FASTQ files and sample names in the sample_list.txt. It wasn't using coincidence as a dependency. It was using task ID from the job array as row number to extract the sample name from sample_list.txt. Users may define their task IDs in the HPC configuration option -t (eg. #$ -t 1-12). HeatMap.R was added back to the repository. The hard coded directories were replaced with instructions in setting the directory.

ADD REPLYlink modified 7 days ago • written 7 days ago by mikefeixu10

I understand that, but after the job is done, if I need to recreate the script, I'd need to look at the sample file, then the log file where you output the TASK ID, and then pick the nth line based on that. That's a lot of look-up and depends on the TASK ID being written to a log file (not an expected, normal or standard thing, just something you're doing out of foresight).

I'd much rather split the sample list into one-line files placed in appropriate directories and have the jobs be identical so they read the same name file in different dirs. This way, everything pertaining to one run is preserved in one directory.

ADD REPLYlink written 7 days ago by _r_am32k

Thank you so much for the great idea! I actually had the issue in mapping the log files with each samples when I had large number of samples. The according improvements will be made by separating the files into each sample directories. In this way, each log filename reflects sample name.

ADD REPLYlink written 7 days ago by mikefeixu10
1

Indeed. Never let computational limitations get in the way of organizing your work more efficiently. The more we search, the better ways we find.

ADD REPLYlink written 7 days ago by _r_am32k

Hi, I've changed this post type to Blog, as you're not introducing a new tool but benchmarking existing ones.

ADD REPLYlink written 7 days ago by _r_am32k

Forum is the more appropriate type. Conceptually blog posts are the content in the PLANET blog aggregator and using them in both contexts may end up as confusing. The Blog type should not be selectable in the dropdown, it is just an oversight.

ADD REPLYlink written 7 days ago by Istvan Albert ♦♦ 86k

I see. Can you also edit my how-to post and add this information there Istvan? I believe the post is inaccurate in how it describes Blogs. How to Use Biostars, Part II: Post types, Deleting, (Un)Subscribing, Linking and Bookmarking

ADD REPLYlink written 7 days ago by _r_am32k

I think it is fine, I don't want to mess with the nice guide there. A minor oversight at most.

ADD REPLYlink written 7 days ago by Istvan Albert ♦♦ 86k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1756 users visited in the last hour
_