Hi folks,
I have been working on sequencing analysis for the past 2 years. We started with tophat2 pairing with cufflinks and it was always working fine. Except when we get more and more samples, tophat2 is kind of slow at mapping. So we tried to switch to STAR and pair with cufflinks (we like the FPKM normalization done by cufflinks. We have tried salmon and the TPM by salmon sometimes seems not accurate - very few reads map to a gene, with like 200 TPM. So we decided to stick with cufflinks. However, here is the problem:
Initially we run cufflink with normal STAR commands, and cufflink will take forever for a simple RNA-seq sample (20-30 mil reads) - actually it never finish after a week. So I did some search online, and people were seeing some improvements in cufflink speed by giving the following commands:
--outFilterIntronMotifs RemoveNoncanonical
--outSAMstrandField intronMotif
--outFilterScoreMinOverLread 0.3
--outFilterMatchNminOverLread 0.3
And this do solve some of the problems - at least I can finish cufflink for a 30 mil reads sample in 7 hours now... (i7-6700, running with 6 threads. or Xeon 2.2GHz frequency with 14 threads. They are actually similar time).
Problems now is, with deeper sequencing, or paired-end sequencing, etc, it takes forever to finish cufflinks (the human sample I am running now took about 3 days - 72 hours for 1 sample, at 16 threads) and we cant afford that time any more (I have 50+ samples waiting...)
Is there some suggestions in speed up STAR paired cufflink pipeline? Thanks!
Long story short, my suggestion to speed up Star-cufflinks pipeline is to get rid of cufflinks and use Stringtie :)