Too many novel isoforms (57% of total) identified using StringTie
1
3
Entering edit mode
10 months ago
Jen ▴ 30

I'm using StringTie to identify novel isoforms from mouse RNAseq data (60M PE reads/sample, stranded, ribodepletion method). The data has been mapped to the mouse genome using STAR. I'm finding that 57% of isoforms are novel. Someone mentioned to me that so many are being identified as novel isoforms by StringTie because single base differences among transcripts was sufficient to call them “different” but that there were ways to relax this so that only truly novel exons were identified. I've looked in the manual for a way to relax this, but couldn't find anything. Here is the code I used for STAR as well as StringTie:

# Make Genome Index

STAR --runThreadN 40 --runMode genomeGenerate --sjdbOverhang 199 --genomeDir GENOME_data/star \ --genomeFastaFiles GENOME_data/GRCm38.primary_assembly.genome.fa \ --sjdbGTFfile GENOME_data/Mus_musculus.GRCm38.100.gtf


STAR --genomeDir GENOME_data/star --readFilesIn 2_Forward.fq 2_Reverse.fq --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM  --outSAMstrandField intronMotif --runThreadN 16 --outFileNamePrefix "2_star/"


# Use StringTie to Quantify Transcripts

stringtie-2.1.4/stringtie 2_star/Aligned.sortedByCoord.out.bam -G GENOME_data/Mus_musculus.GRCm38.100.gtf -A 2_gene_abund.tab -B -o 2.gtf

StringTie Novel Isoforms • 392 views
0
Entering edit mode

Well you could always cluster the sequences after the fact. Use mmseqs2 easy-cluster and cluster at a high threshold (e.g., 97% identity and 97% bidirectional coverage), and that'd eliminate "redundant" isoforms.

0
Entering edit mode
10 months ago
Jen ▴ 30

Is it uncommon to see so many novel isoforms? In Pertea et al 2016 (the tutorial I'm following for Hisat, StringTie, Ballgown) it say's the following regarding the results for the sample data "...there are nearly as many novel transcripts (isoforms) as known transcripts in each sample. Most of the transcriptome diversity is due to alternative splicing, and it is not unusual to observe that a large fraction of isoforms in an RNA-seq experiment are novel." It seems that maybe 57% of transcripts being novel is not unheard of.

0
Entering edit mode

I am not an assembly guy at all, but there was a paper of the Salzberg group (who developed stringtie) in GenomeBiology where they assembled deeply-sequenced human RNA-seq data and claimed to have found many novel transcripts including protein-coding ones. Shortly after the initial preprint at biorxiv came up the GENCODE consortium replied to it with this preprint claiming that most of their new protein-coding predictions could not be confirmed by independent experiments. It is therefore at least questionable how feasable and reliable the stringtie results are, especially in well-annotated organisms such as human and mouse, and whether it does not create a lot more artifacts than really new and interesting (and by this meaningful) transcripts/isoforms. As said, I am not an assembly-experienced guy therefore I cannot really judge either of the linked papers, therefore leaving it to you what you make out of it towards trusting or reconsidering your results.

0
Entering edit mode

Dang, thanks for that.