Question

Forum:NGS RNA-Seq Analysis Pipeline Tophat vs STAR

0

Entering edit mode

3.3 years ago

mikefeixu ▴ 10

It appears STAR is 2X faster than Tophat.

Code and examples are available on Github: https://github.com/mikefeixu/RNA-Seq

STAR RNA-Seq TopHat • 2.1k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 3.3 years ago by mikefeixu ▴ 10

2

Entering edit mode

that's not really new info. STAR is a modern and well maintained aligner, TopHat is over a decade old and not maintained anymore for years. I'm only surprised it's only 2x faster (sic) , I have the impression it's even faster than that.

What is the point of that github repo?

ADD REPLY • link 3.3 years ago by lieven.sterck 15k

1

Entering edit mode

The pipeline in the github gives the code for RNA-Seq anaysis from scratch in processing rawdata to data visualization in paper ready heatmaps. It might be helpful for beginners who wants to try Tophat and STAR in their RNA-Seq analysis.

The Tophat pipeline takes 8~12 hours for each sample, while the STAR pipeline takes only 2~3 hours for each sample.

ADD REPLY • link 3.3 years ago by mikefeixu ▴ 10

1

Entering edit mode

OK, and complementing the comments from _r_am , please don't point beginners to using TopHat in any way !! (it should not be used anymore, even the authors of that tool are putting this out on the internet, it has been replaced by much more efficient and up-to-date tools)

ADD REPLY • link 3.3 years ago by lieven.sterck 15k

0

Entering edit mode

Your code would actually confuse beginners, IMO. You have a lot going on in not-so-well documented code with unnecessary steps that is neither representative of a tool's simple usage, nor a well-documented step-by-step pipeline. The repo looks like a personal code stash or one created for an assignment, not something that you built to share with anyone.

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

Hello Lieven,

I hope all is well. I spent some time to polish the pipeline and wrote detailed user guides for the RNA-Seq Pipeline. Would you mind letting me know your thoughts about it? The purpose is to provide a one stop code repository for people who wants to save time in RNA-Seq QC and analysis.

Best regards, Fei

ADD REPLY • link 3.0 years ago by mikefeixu ▴ 10

2

Entering edit mode

Hi mikefeixu, sorry if this did not turn out the way you intended, please do not feel discouraged. In general participation is welcome, but please make sure that you check first whether a post is meaningful, in this case unfortunately not because the result is expected and already published multiple times.

As the Github repository is deleted now (it seems), and if that is the case, then please consider deleting this thread here by clicking moderate and then delete as without the repo it has no meaning. Thank you!

ADD REPLY • link 3.3 years ago by ATpoint 82k

0

Entering edit mode

Thank you very much for the feedback. I just updated the github repository, cleaned up a bit per your feedback. Would you mind letting me know if any improvements those I can make?

ADD REPLY • link 3.3 years ago by mikefeixu ▴ 10

0

Entering edit mode

The link is still dead for me. Make sure is's not private.

ADD REPLY • link 3.3 years ago by ATpoint 82k

0

Entering edit mode

Would you minding trying again?

ADD REPLY • link 3.3 years ago by mikefeixu ▴ 10

0

Entering edit mode

The plus side is that this code would be useful to someone looking to replicate or debug your analysis in your team, but not outside of it. I'd rather have the code you wrote for the analysis than not have any code at all, but that isn't saying much.

Here's a tip: Write comments - lots of them. In fact, instead of an R script, use R notebooks. write your thoughts and instructions in plain text or markdown, and write the code in code blocks.

As for the SGE part, it looks like your execution uses the TASK ID for purposes related to the data (sample=$(awk "NR==${SGE_TASK_ID}" $sourcedir/sample_list.txt)), which is dangerous territory. You're using a coincidence as a dependency and that _will_ break all hell loose when that coincidence ceases to exist. Instead of this, use either a loop or parallel to generate SLURM scripts that you can then subsequently run. Even better, use snakemake or a similar workflow management tool so everything is replicable.

ADD REPLY • link updated 3.3 years ago by ATpoint 82k • written 3.3 years ago by Ram 43k

0

Entering edit mode

Are you planning on sharing the data files as well (the FASTQ, sample list file, etc)? Your R code is dependent on a personal dropbox folder, which makes it sort of unusable for anyone else.

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

Thank you so much for your patience. The FASTQ files are too big to share. I just added the counts file instead. My purpose is to make the code independent of any project. Users need to provide their own FASTQ files and sample names in the sample_list.txt. It wasn't using coincidence as a dependency. It was using task ID from the job array as row number to extract the sample name from sample_list.txt. Users may define their task IDs in the HPC configuration option -t (eg. #$ -t 1-12). HeatMap.R was added back to the repository. The hard coded directories were replaced with instructions in setting the directory.

ADD REPLY • link 3.3 years ago by mikefeixu ▴ 10

0

Entering edit mode

I understand that, but after the job is done, if I need to recreate the script, I'd need to look at the sample file, then the log file where you output the TASK ID, and then pick the nth line based on that. That's a lot of look-up and depends on the TASK ID being written to a log file (not an expected, normal or standard thing, just something you're doing out of foresight).

I'd much rather split the sample list into one-line files placed in appropriate directories and have the jobs be identical so they read the same name file in different dirs. This way, everything pertaining to one run is preserved in one directory.

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

Thank you so much for the great idea! I actually had the issue in mapping the log files with each samples when I had large number of samples. The according improvements will be made by separating the files into each sample directories. In this way, each log filename reflects sample name.

ADD REPLY • link 3.3 years ago by mikefeixu ▴ 10

1

Entering edit mode

Indeed. Never let computational limitations get in the way of organizing your work more efficiently. The more we search, the better ways we find.

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

I saw something today that reminded me of your code: The NASA RNAseq pipeline scripts are entangled with their HPC the same way yours is. I guess you can say your code is NASA worthy :-D

https://github.com/nasa/GeneLab_Data_Processing/blob/master/RNAseq/GLDS_Processing_Scripts/GLDS-120/01-TG_Preproc/trim-galore.slurm

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

Hi, I've changed this post type to Blog, as you're not introducing a new tool but benchmarking existing ones.

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

Forum is the more appropriate type. Conceptually blog posts are the content in the PLANET blog aggregator and using them in both contexts may end up as confusing. The Blog type should not be selectable in the dropdown, it is just an oversight.

ADD REPLY • link 3.3 years ago by Istvan Albert 100k

0

Entering edit mode

I see. Can you also edit my how-to post and add this information there Istvan? I believe the post is inaccurate in how it describes Blogs. How to Use Biostars, Part II: Post types, Deleting, (Un)Subscribing, Linking and Bookmarking

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

I think it is fine, I don't want to mess with the nice guide there. A minor oversight at most.

ADD REPLY • link 3.3 years ago by Istvan Albert 100k