Annotation for gene-level analysis after stringtie
0
0
Entering edit mode
14 months ago
atorson • 0

Hi all,

Forgive me if this question has been posted before (I'm new to this forum). In the past, when doing transcript-level RNA-seq analyses, I have used the gffread utility to pull transcript sequences for annotation and downstream functional analyses.

i.e.

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf


However, I am currently working on a gene-level analysis and am having issues doing something similar - especially in regard to evaluating novel loci in my Stringtie assemblies. The way I see it, I have one of two options:

1) Use my RefSeq IDs for non-novel loci and annotate novel loci separately and then merge these annotation lists prior to functional analysis. (If I go this route, I still need to generate a FASTA with the novel loci anyway)

or

2) Pull all gene/loci sequences and re-annotate everything

Is there a (relatively) simple work-around for this?

RNA-Seq stringtie tuxedo annotation • 852 views
0
Entering edit mode

to which do you want a work-around?

"Pull all gene/loci sequences" or "re-annotate everything" or both?

0
Entering edit mode

Possibly both, since I'm not sure which one will work better for me when I'm annotating downstream. Since my refseq GTF had 14,6XX genes and my new merged assembly (using Taco) has about 24,000, so I've got a lot of novel loci to deal with. My plan now is to carry the orginal refseq IDs through (primarily because I can more easily do functional annotation with those IDs) for the reference loci and then do protein coding potential prediction and annotation of the novel loci. It seems to me that a work-around to "pull all gene/loci sequences" and then just filter out novel loci and annotating those could be an effective way to get about this. Thoughts?

0
Entering edit mode

that plan does sound fair indeed.

to 'annotate' the new transcripts you could submit them to some online resources that look for protein and function in a (semi) automated way. One that comes to mind is Trapid, but there are others around for sure.

0
Entering edit mode

Yeah, that makes sense (and I'll have a look at Trapid!). However, my primary issue here is that I can't figure out how to construct a FASTA file for the gene/loci sequences. Pulling the transcript sequences with gffread is straightforward, but gene/loci on the other hand...

0
Entering edit mode

normally extracting the transcripts should be enough (if it is for annotation that is), no need to get the complete loci

if you have a gff/gtf file from the novel transcripts you can extract them with for instance gffread from bedtools or using the AGAT package

0
Entering edit mode

If you provide stringtie with the reference annotation when you assemble, then it will do 1 for you - that is known transcripts will get their known transcript ids and novel transcripts of known genes will get their known gene id. Only novel transcripts of novel genes will get a novel gene id.

0
Entering edit mode

That's correct, I can easily ID which genes are novel when I include a reference annotation in my pipeline, but my issue is actually getting the gene sequences (specifically the novel loci) as a FASTA, so I can do the annotation downstream.

0
Entering edit mode

What do you mean by "annotate"? Annotate in what way? For what purpose?

If you want just sequences of the novel I'd do something like:

grep 'gene_id "MSTRG[0-9]+"' transcripts.gtf > novel_genes.gtf
gffread -w transcripts.fa -g /path/to/genome.fa novel_genes.gtf


Is the idea that you want try to attach something like domain or GO annotations to the novel genes?