Question

De novo RNA-seq

0

Entering edit mode

6.3 years ago

blawney ▴ 10

I'm evaluating some options for novel isoform detection and I'm looking for some advice/experience on settings. I finally produced the "spiked" results I expected, but I'm concerned about an explosion of memory, etc. when I run it on a full-sized experiment.

TL;DR: what are typical settings for Trinity on RNA-seq of PE75 or PE100? I had to use --min_contig_length=40 to get Trinity to find a contig that I know exists for simulated PE100 reads.

My setup was--

I randomly selected a handful of gene name/IDs. I then get the sequence of those genes and some intergenic regions to place between them, creating a mock genome. For each of those genes, I create a set of transcripts by splicing the various exons in the appropriate order (e.g. no circRNA). I also create TWO GTF files. One is the full GTF, including all transcripts. The other removes a transcript corresponding to a skipped exon (in my example the gene has 3 exons, and the "spiked" mature transcript would be exon 1 spliced to exon 3, skipping exon 2).
Using BioC's Polyester, I simulated some stranded, paired-end reads. To do the simulation, I use the full GTF and transcript fasta from step 1. This creates a pair of fasta files. I can see that there are ~300 reads that are split across the junction (based on CIGAR strings from a STAR align AND an IGV sashimi plot).
I first tried by TransAbyss since it has the companion PAVFinder tool which will align the assembled contigs to the genome, etc. and report potentially novel splicings. As part of TransAbyss/PAVFinder, you supply a GTF of known annotations along with the sequencing reads. Unfortunately, this produced no results. This was after merging assemblies produced for kmers of 22,24,26,28,30,32.

When I look at the assembled contigs produced by TransAbyss, it produces a perfect match for the full transcript (exon1:exon2:exon3), but does NOT generate any contigs for the skipped exon1:exon3 form. If I manually add a sequence spanning the exon1:exon3 junction to the FASTA, PAVFinder correctly identifies the "novel" isoform.

At this point I figured the problem lies in the assembly itself. I then decided to use Trinity since it has the --min_contig_length parameter (which naively seemed to be the way to go). Ultimately, I had to reduce the --min_contig_length to 40 (default is 200) to finally get it to produce a contig corresponding to my exon1:exon3 splice form. Is this typical? I don't have a lot of experience with Trinity, so I'm a bit concerned that setting it this low will explode my memory footprint on a real-sized data set. The simulated reads are quite clean (and real samples will be from cancer samples), so I imagine the real graph will be far more complex.

Thanks for any tips!

RNA-Seq alignment • 1.8k views

ADD COMMENT • link updated 6.3 years ago by dario.garvan ▴ 520 • written 6.3 years ago by blawney ▴ 10

score 0 · Answer 1 · 2018-02-11

0

Entering edit mode

6.3 years ago

dario.garvan ▴ 520

The minimum contig length of 200 is used by most published research using Trinity. It's a simple filtering to avoid reporting numerous partial transcripts which are not likely to be long enough to code for a protein. I guess that the two exons which you designed for testing are both quite short.

ADD COMMENT • link 6.3 years ago by dario.garvan ▴ 520

0

Entering edit mode

The exons in question were 342 bp and 534bp (taken from WBP1 gene). Perhaps I need to dig into the details in the graph algorithms a bit more. Perhaps start default and then do a bit of exploration in parameter space, depending on how the run times are.

I intentionally used rather high coverage to be sure I was not missing anything due to low coverage depth. With several hundred split reads spanning the junction, I will say it surprised me a bit that it couldn't find it under default settings (especially with rather clean simulated reads). If I reduce the --length parameter for TransAbyss, it eventually sldo finds a junction-spanning contig; unfortunately, that comes with a bit of an explosion of very similar, shorter contigs which feature the simulated basecall errors. But I suppose that's just the reality of cranking up the sensitivity...

ADD REPLY • link 6.3 years ago by blawney ▴ 10