Hello! I've de-novo assembled a transcriptome from Trinity, resulting into Trinity.fasta, whose headers look like this:
>TRINITY_DN29256_c0_g1_i1 len=323 path=[0:0-322]
Followed, in the next line, by the sequence.
To run an external downstream analysis with a R script, I'd need to have a .gff3 reference file (FeatureCounts function from RSubread). Of course, for now, annotation isn't needed, just names and coordinates.
I've already performed a classic edgeR analysis with Trinity, I'm just trying something different and need this very specific input file.
Can anyone help me here? Thanks in advance!
I do not have experience with Trinity, but I have seen similar cases where a GFF3 was obtained by mapping the Trinity fasta to the reference with GMAP. Maybe it can help in your case.
I've tried to use GMAP, with the following code, but the script seems to freeze for no reason and I get an empty output file.
What do you mean by reference? It's a de-novo assembly, because my organism is not a model one, so I don't really have one.
gmap
is to map transcripts against a reference genome. Thegff
you get describe the location and the structure of the transcripts within the reference genome. As you don't have reference genome it is useless here.What you can do it is to use transcoder to predict the coding regions within a transcript fasta file. The
gff
you will get describe the feature type of the different regiosn in each sequence, i.e the exon and what is coding (CDS) and what is non-coding (UTR).Maybe map with minimap2 instead, then bamtobed, then to gff (or maybe there's a direct bam->gff converter...)