Hi all,
Forgive me if this question has been posted before (I'm new to this forum). In the past, when doing transcript-level RNA-seq analyses, I have used the gffread utility to pull transcript sequences for annotation and downstream functional analyses.
i.e.
gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf
However, I am currently working on a gene-level analysis and am having issues doing something similar - especially in regard to evaluating novel loci in my Stringtie assemblies. The way I see it, I have one of two options:
1) Use my RefSeq IDs for non-novel loci and annotate novel loci separately and then merge these annotation lists prior to functional analysis. (If I go this route, I still need to generate a FASTA with the novel loci anyway)
or
2) Pull all gene/loci sequences and re-annotate everything
Is there a (relatively) simple work-around for this?
to which do you want a work-around?
"Pull all gene/loci sequences" or "re-annotate everything" or both?
Possibly both, since I'm not sure which one will work better for me when I'm annotating downstream. Since my refseq GTF had 14,6XX genes and my new merged assembly (using Taco) has about 24,000, so I've got a lot of novel loci to deal with. My plan now is to carry the orginal refseq IDs through (primarily because I can more easily do functional annotation with those IDs) for the reference loci and then do protein coding potential prediction and annotation of the novel loci. It seems to me that a work-around to "pull all gene/loci sequences" and then just filter out novel loci and annotating those could be an effective way to get about this. Thoughts?
that plan does sound fair indeed.
to 'annotate' the new transcripts you could submit them to some online resources that look for protein and function in a (semi) automated way. One that comes to mind is Trapid, but there are others around for sure.
Yeah, that makes sense (and I'll have a look at Trapid!). However, my primary issue here is that I can't figure out how to construct a FASTA file for the gene/loci sequences. Pulling the transcript sequences with gffread is straightforward, but gene/loci on the other hand...
normally extracting the transcripts should be enough (if it is for annotation that is), no need to get the complete loci
if you have a gff/gtf file from the novel transcripts you can extract them with for instance
gffread
from bedtools or using theAGAT
packageIf you provide stringtie with the reference annotation when you assemble, then it will do 1 for you - that is known transcripts will get their known transcript ids and novel transcripts of known genes will get their known gene id. Only novel transcripts of novel genes will get a novel gene id.
That's correct, I can easily ID which genes are novel when I include a reference annotation in my pipeline, but my issue is actually getting the gene sequences (specifically the novel loci) as a FASTA, so I can do the annotation downstream.
What do you mean by "annotate"? Annotate in what way? For what purpose?
If you want just sequences of the novel I'd do something like:
Is the idea that you want try to attach something like domain or GO annotations to the novel genes?