Converting an output de-novo transcriptome assembled with Trinity to a .gff3 file
2
0
Entering edit mode
2.9 years ago
Raito92 ▴ 60

Hello! I've de-novo assembled a transcriptome from Trinity, resulting into Trinity.fasta, whose headers look like this:

>TRINITY_DN29256_c0_g1_i1 len=323 path=[0:0-322]


Followed, in the next line, by the sequence.

To run an external downstream analysis with a R script, I'd need to have a .gff3 reference file (FeatureCounts function from RSubread). Of course, for now, annotation isn't needed, just names and coordinates.

I've already performed a classic edgeR analysis with Trinity, I'm just trying something different and need this very specific input file.

Can anyone help me here? Thanks in advance!

Trinity • 1.9k views
0
Entering edit mode

I do not have experience with Trinity, but I have seen similar cases where a GFF3 was obtained by mapping the Trinity fasta to the reference with GMAP. Maybe it can help in your case.

0
Entering edit mode

I've tried to use GMAP, with the following code, but the script seems to freeze for no reason and I get an empty output file.

gmap -d Trinity.fasta -f 3 > meh.gff3


What do you mean by reference? It's a de-novo assembly, because my organism is not a model one, so I don't really have one.

0
Entering edit mode

gmap is to map transcripts against a reference genome. The gff you get describe the location and the structure of the transcripts within the reference genome. As you don't have reference genome it is useless here.

What you can do it is to use transcoder to predict the coding regions within a transcript fasta file. The gff you will get describe the feature type of the different regiosn in each sequence, i.e the exon and what is coding (CDS) and what is non-coding (UTR).

0
Entering edit mode

Maybe map with minimap2 instead, then bamtobed, then to gff (or maybe there's a direct bam->gff converter...)

1
Entering edit mode
2.7 years ago
h.mon 34k

featureCounts assigns zero counts to multi-mapped reads. Trinity assemblies have a lot of "redundancy", as the assembler tries to recover all possible isoforms of a gene. This would mean a lot of the mapped reads would map to multiple locations (to several isoforms), and featureCounts would assign zero counts to all those reads. Better approaches to deal with this would be quantification with RSEM, Salmon or kallisto.

1
Entering edit mode
5 months ago
danvoronov ▴ 30

Trinity has a cdna_fasta_file_to_transcript_gtf.pl script that makes a GTF file out of Trinity FASTA in the util/misc folders of the Trinity installation.

perl /<trinity_folder>/util/misc/cdna_fasta_file_to_transcript_gtf.pl Trinity.fasta | grep -w "exon" - > Trinity.gtf


You can also remove the pipe and whats after it, I have it since some software requires the GTF to have only "exon" lines: perl /<trinity_folder>/util/misc/cdna_fasta_file_to_transcript_gtf.pl Trinity.fasta > Trinity.gtf

gffread Trinity.gtf -o Trinity.gff3


This essentially gives the GTF/GFF3 file with locations of starts and ends of the FASTA sequences. Then in software requiring such formats like GTF/GFF3 the Trinity.fasta can be used in place of the "genome" file, if no reference genome is available to map transcriptome to.