Hi all, I am running a workflow to identify single copy orthogroups from RNAseq data including 9 species in a family of non-model organisms. All 9 species are closely related enough that they can be aligned to the same reference genome (one of the 9 species) with high alignment rates. I am concerned that my post-alignment transcript assembly step may be causing some downstream problems.
Background:
I am assembling transcripts from my species using StringTie following the workflow:
Reference genome-guided assembly of .bam to gtf
stringtie $i.bam -o ${species}_${sample}_2ref.gtf -a 8 -A -G reference.gtf -l ${species}_transcript
Merge all
stringtie ./*.gtf --merge -o all_merged.gtf -G reference.gtf -l all_transcripts
Merged-guided assembly of .bam to gtf
stringtie $i.bam -o ${species}_${species}_2merged.gtf -a 8 -A -G all_merged.gtf
I have tried transcript assembly in two different ways. 1. Merging gtfs in step two in a species-specific manner and 2. Merging all gtfs during step two. The former results in extremely few orthogroups discovered downstream in my workflow, and the latter results in many orthogroups but incorrect species trees being built from multiple species alignments. In all cases, I recover 0 gene names from my reference after stringtie. Every single assembled transcript has an "MSTRG" identity. I am confused as to why every single transcript cannot be contextualized back to the reference gtf.
Questions:
- Is it appropriate to merge all transcripts from 9 different species (aligned to the same reference) in the "Stringtie --merge" step? I have only seen single-species use up until now.
- Could the fact that 0 gene names are preserved and every transcript is unique enough to get an MSTRG identifier be contributing to the difficulty in comparing transcripts to build orthogroups downstream. (I am using TransDecoder, CD-HIT, and OrthoFinder after the StringTie step to find open-reading frames and predicted amino acid sequences)
- Has anybody had experience using these tools for orthology comparisons from short-read RNA data?
Thanks for the help! If there is crucial information that I left out, please let me know, this is my first post here.