Question: De novo RNA-seq
gravatar for blawney
2.5 years ago by
blawney10 wrote:

I'm evaluating some options for novel isoform detection and I'm looking for some advice/experience on settings. I finally produced the "spiked" results I expected, but I'm concerned about an explosion of memory, etc. when I run it on a full-sized experiment.

TL;DR: what are typical settings for Trinity on RNA-seq of PE75 or PE100? I had to use --min_contig_length=40 to get Trinity to find a contig that I know exists for simulated PE100 reads.

My setup was--

  1. I randomly selected a handful of gene name/IDs. I then get the sequence of those genes and some intergenic regions to place between them, creating a mock genome. For each of those genes, I create a set of transcripts by splicing the various exons in the appropriate order (e.g. no circRNA). I also create TWO GTF files. One is the full GTF, including all transcripts. The other removes a transcript corresponding to a skipped exon (in my example the gene has 3 exons, and the "spiked" mature transcript would be exon 1 spliced to exon 3, skipping exon 2).

  2. Using BioC's Polyester, I simulated some stranded, paired-end reads. To do the simulation, I use the full GTF and transcript fasta from step 1. This creates a pair of fasta files. I can see that there are ~300 reads that are split across the junction (based on CIGAR strings from a STAR align AND an IGV sashimi plot).

  3. I first tried by TransAbyss since it has the companion PAVFinder tool which will align the assembled contigs to the genome, etc. and report potentially novel splicings. As part of TransAbyss/PAVFinder, you supply a GTF of known annotations along with the sequencing reads. Unfortunately, this produced no results. This was after merging assemblies produced for kmers of 22,24,26,28,30,32.

When I look at the assembled contigs produced by TransAbyss, it produces a perfect match for the full transcript (exon1:exon2:exon3), but does NOT generate any contigs for the skipped exon1:exon3 form. If I manually add a sequence spanning the exon1:exon3 junction to the FASTA, PAVFinder correctly identifies the "novel" isoform.

At this point I figured the problem lies in the assembly itself. I then decided to use Trinity since it has the --min_contig_length parameter (which naively seemed to be the way to go). Ultimately, I had to reduce the --min_contig_length to 40 (default is 200) to finally get it to produce a contig corresponding to my exon1:exon3 splice form. Is this typical? I don't have a lot of experience with Trinity, so I'm a bit concerned that setting it this low will explode my memory footprint on a real-sized data set. The simulated reads are quite clean (and real samples will be from cancer samples), so I imagine the real graph will be far more complex.

Thanks for any tips!

rna-seq alignment • 919 views
ADD COMMENTlink modified 2.5 years ago by dario.garvan470 • written 2.5 years ago by blawney10
gravatar for dario.garvan
2.5 years ago by
dario.garvan470 wrote:

The minimum contig length of 200 is used by most published research using Trinity. It's a simple filtering to avoid reporting numerous partial transcripts which are not likely to be long enough to code for a protein. I guess that the two exons which you designed for testing are both quite short.

ADD COMMENTlink written 2.5 years ago by dario.garvan470

The exons in question were 342 bp and 534bp (taken from WBP1 gene). Perhaps I need to dig into the details in the graph algorithms a bit more. Perhaps start default and then do a bit of exploration in parameter space, depending on how the run times are.

I intentionally used rather high coverage to be sure I was not missing anything due to low coverage depth. With several hundred split reads spanning the junction, I will say it surprised me a bit that it couldn't find it under default settings (especially with rather clean simulated reads). If I reduce the --length parameter for TransAbyss, it eventually sldo finds a junction-spanning contig; unfortunately, that comes with a bit of an explosion of very similar, shorter contigs which feature the simulated basecall errors. But I suppose that's just the reality of cranking up the sensitivity...

ADD REPLYlink written 2.5 years ago by blawney10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 711 users visited in the last hour