I'm evaluating some options for novel isoform detection and I'm looking for some advice/experience on settings. I finally produced the "spiked" results I expected, but I'm concerned about an explosion of memory, etc. when I run it on a full-sized experiment.
TL;DR: what are typical settings for Trinity on RNA-seq of PE75 or PE100? I had to use
--min_contig_length=40 to get Trinity to find a contig that I know exists for simulated PE100 reads.
My setup was--
I randomly selected a handful of gene name/IDs. I then get the sequence of those genes and some intergenic regions to place between them, creating a mock genome. For each of those genes, I create a set of transcripts by splicing the various exons in the appropriate order (e.g. no circRNA). I also create TWO GTF files. One is the full GTF, including all transcripts. The other removes a transcript corresponding to a skipped exon (in my example the gene has 3 exons, and the "spiked" mature transcript would be exon 1 spliced to exon 3, skipping exon 2).
Using BioC's Polyester, I simulated some stranded, paired-end reads. To do the simulation, I use the full GTF and transcript fasta from step 1. This creates a pair of fasta files. I can see that there are ~300 reads that are split across the junction (based on CIGAR strings from a STAR align AND an IGV sashimi plot).
I first tried by TransAbyss since it has the companion PAVFinder tool which will align the assembled contigs to the genome, etc. and report potentially novel splicings. As part of TransAbyss/PAVFinder, you supply a GTF of known annotations along with the sequencing reads. Unfortunately, this produced no results. This was after merging assemblies produced for kmers of 22,24,26,28,30,32.
When I look at the assembled contigs produced by TransAbyss, it produces a perfect match for the full transcript (exon1:exon2:exon3), but does NOT generate any contigs for the skipped exon1:exon3 form. If I manually add a sequence spanning the exon1:exon3 junction to the FASTA, PAVFinder correctly identifies the "novel" isoform.
At this point I figured the problem lies in the assembly itself. I then decided to use Trinity since it has the
--min_contig_length parameter (which naively seemed to be the way to go). Ultimately, I had to reduce the
--min_contig_length to 40 (default is 200) to finally get it to produce a contig corresponding to my exon1:exon3 splice form. Is this typical? I don't have a lot of experience with Trinity, so I'm a bit concerned that setting it this low will explode my memory footprint on a real-sized data set. The simulated reads are quite clean (and real samples will be from cancer samples), so I imagine the real graph will be far more complex.
Thanks for any tips!