Detecting Novel Spliceforms from RNAseq data- Will STAR>RSEM work or do I need to use something else?
Entering edit mode
3.5 years ago
Jen ▴ 30

I am trying to determine the best RNAseq analysis pipeline to use to identify novel spliceforms (I also care about non-coding RNAs). I have an enormous RNAseq dataset which I have already analyzed using STAR to map and RSEM to quantify. My data is in .fastq, generated from stranded library, ribodepletion method, 60M PE reads/sample, 100bp reads and from mouse. I thought I read that RSEM was not able detect novel spliceforms (am I wrong??). The pipeline I am thinking would work for this is Hisat2 > Stringtie > Ballgown. My questions are: (1) Can I use my current pipeline (STAR > RSEM) to identify novel spliceforms using special run parameters or do I need to redo the analysis with a different pipeline. (2) If a different pipeline would be better for this, which pipeline would people recommend and what options would you use for mapping and quantification.

My boss and I never discussed wanting to identify novel spliceforms and now he has a grant due and wants this data ASAP, so I'm on a timecrunch! Also, I've only been doing bioinformatics for two years and have taught myself, so I apologize if anything doesn't make sense. Please ask for clarification if needed. Your advice is greatly appreciated.

RNA-Seq • 1.2k views
Entering edit mode

If you have already your Illumina reads, and try to map to the reference genome, you need an splice aware mapper such as HISAT2 or STAR to unravel the junctions. You are comparing reads coming from mature RNA without introns with a reference genome that have them.

Entering edit mode

So right now, I have STAR output which was generated by mapping my reads to GRCm38.dna.primary_assembly.fa generated by-

STAR --genomeDir star --readFilesIn Sample1_Forward.fq Sample1_Reverse.fq --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM --runThreadN 16 --outFileNamePrefix "Sample1_star/"

The output files are: Aligned.sortedByCoord.out.bam
_STARgenome Aligned.toTranscriptome.out.bam Log.progress.out _STARpass1

Then normally I would use RSEM on the STAR output file- Aligned.toTranscriptome.out.bam

RSEM-1.3.1/rsem-calculate-expression --bam -p 16 \ --paired-end --forward-prob .5 \ Sample1_star/Aligned.toTranscriptome.out.bam \ rsem/GRCm38 Sample1_rsem/rsem >& \ Sample1_rsem/rsem.log

The output files for RSEM are: rsem.genes.results
rsem.transcript.bam rsem.isoforms.results rsem.stat

I've been using the rsem.genes and rsem.isoform files so far for analysis. I was assuming that these files only contain known isoforms. Do my RSEM results already contain information on novel spliceforms? And I just am unaware of how to access it? Sorry if any of this is obvious. Also sorry for the formatting of this reply. I'm still getting used to doing it.

Entering edit mode
3.5 years ago

This would not be told a year ago...

I would go to run IsoSeq sequencing with PacBio HiFi to answer this question


  1. Prices for PacBio sequencing has dropped dramatically. If can be pretty similar to that of Illumina nowdays

  2. IsoSeq sequencing involves the true sequencing of your RNA population, and not statistical inference is required that in many cases lead to false data. I mean that you end sequencing the whole mRNA, from the beginning to the end. The actual sequence of your RNA is obtained and with HiFi reads, with quality values that exceed those of Illumina

That way you get rid of using the mapping of your lectures


Login before adding your answer.

Traffic: 3228 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6