I wonder if anyone can give me some advice on a potential pipeline for analysis of differential transcript-level expression and novel isoform discovery. The pilot project I am working on involves analysis of mRNA-seq data derived from a human cell line which has been transfected with various plasmids. Consequently we are interested not only in how transfection affects transcription from the cell genome but also transcripts derived from the transfected plasmids.
We are out-sourcing the analysis to a company which is using the old tuxedo protocol (or a modification of it) but I would like to develop an in-house pipeline. I have installed HISAT2-Stringtie-Ballgown as I understand this may be quicker than the old tuxedo suite (my samples have been sequenced to a very deep read depth of 120M).
However, I would need to build the a custom genome index incorporating the sequences of all the plasmids (supplied as .gtf files) and the Hg19 human reference. From my reading this may be too memory intensive for the current machine I am using (a single machine with an I7-4790 CPU and 32GB RAM) if I use HISAT2.
I was thinking that I might try a strategy where I build an index for just the plasmids first (which should be much less memory intensive) and map the reads against them (as this is the primary aim of the project), then perform a second mapping (using the unmapped reads from the 1st run) against the Hg19 reference index (which I can download).
Is this a feasible strategy and does anyone have any recommendations/suggestions? Or am I just going to have to bite the bullet and request a much more powerful machine for this sort of analysis? Also how important is correct annotation for HISAT2 when run for isoform discovery?