Transcript Assembly for Multiple Species Using StringTie and Orthogroup Discovery using OrthoFinder
0
0
Entering edit mode
10 months ago
dtnondorf • 0

Hi all, I am running a workflow to identify single copy orthogroups from RNAseq data including 9 species in a family of non-model organisms. All 9 species are closely related enough that they can be aligned to the same reference genome (one of the 9 species) with high alignment rates. I am concerned that my post-alignment transcript assembly step may be causing some downstream problems.

Background:

I am assembling transcripts from my species using StringTie following the workflow:

  1. Reference genome-guided assembly of .bam to gtf

    stringtie $i.bam -o ${species}_${sample}_2ref.gtf -a 8 -A -G reference.gtf -l ${species}_transcript

  2. Merge all

    stringtie ./*.gtf --merge -o all_merged.gtf -G reference.gtf -l all_transcripts

  3. Merged-guided assembly of .bam to gtf

    stringtie $i.bam -o ${species}_${species}_2merged.gtf -a 8 -A -G all_merged.gtf

I have tried transcript assembly in two different ways. 1. Merging gtfs in step two in a species-specific manner and 2. Merging all gtfs during step two. The former results in extremely few orthogroups discovered downstream in my workflow, and the latter results in many orthogroups but incorrect species trees being built from multiple species alignments. In all cases, I recover 0 gene names from my reference after stringtie. Every single assembled transcript has an "MSTRG" identity. I am confused as to why every single transcript cannot be contextualized back to the reference gtf.

Questions:

  1. Is it appropriate to merge all transcripts from 9 different species (aligned to the same reference) in the "Stringtie --merge" step? I have only seen single-species use up until now.
  2. Could the fact that 0 gene names are preserved and every transcript is unique enough to get an MSTRG identifier be contributing to the difficulty in comparing transcripts to build orthogroups downstream. (I am using TransDecoder, CD-HIT, and OrthoFinder after the StringTie step to find open-reading frames and predicted amino acid sequences)
  3. Has anybody had experience using these tools for orthology comparisons from short-read RNA data?

Thanks for the help! If there is crucial information that I left out, please let me know, this is my first post here.

Species StringTie Orthogroups Transcript Assembly • 729 views
ADD COMMENT

Login before adding your answer.

Traffic: 896 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6