How to get the genes that contain the highest number of isoforms from a fasta file generated by trinity de novo?
2
0
Entering edit mode
15 months ago

I have a trinity file in a fasta format and I want to get the gene(s) with the highest no. of isoforms.

assembly fasta trinity • 2.4k views
1
Entering edit mode

I'm assuming you're referring to the contigs FASTA file generated by trinity de novo. If that's a correct assumption, please state this overtly in your question.

A FASTA file contains sequences. Genes and isoforms are functional annotations that go on top. Without the latter annotation process, what you're asking for is not possible.

EDIT: I was mistaken - Trinity seems to add gene and isoform annotations to its output (though I still think these annotations are unreliable without a GTF especially when they result from pure computational analyses)

0
Entering edit mode

ok, I have a FASTA file generated by trinity de novo, how then I can get the gene(s) with the highest no. of isoforms?

0
Entering edit mode

You do not have a "trinity file" - you ave the assembled output FASTA format from trinityrnaseq. What organism is this for? Does it have reference gene/transcript annotations?

0
Entering edit mode

does the longest sequence is the sequence that contains the highest number of isoforms?

1
Entering edit mode

The longest sequence is the sequence with the most bases. Please read up on de novo RNAseq processing methods and try something on your own before asking here.

1
Entering edit mode
15 months ago

Hey,

this could work.

You should have a file called: Trinity.fasta.gene_trans_map link

library(readr)
"\t", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)

# A tibble: 6 x 2
#X1                 X2
#<chr>              <chr>
#1 TRINITY_DN1_c0_g1  TRINITY_DN1_c0_g1_i1
#2 TRINITY_DN1_c0_g2  TRINITY_DN1_c0_g2_i1
#3 TRINITY_DN13_c0_g1 TRINITY_DN13_c0_g1_i1
#4 TRINITY_DN13_c0_g1 TRINITY_DN13_c0_g1_i2
#5 TRINITY_DN14_c0_g1 TRINITY_DN14_c0_g1_i1
#6 TRINITY_DN15_c0_g1 TRINITY_DN15_c0_g1_i1

out<-as.matrix(table(Trinity_fasta_gene_trans_map$X1)) #get the number of isoform per gene head(out) [,1] TRINITY_DN1_c0_g1 1 TRINITY_DN1_c0_g2 1 TRINITY_DN101_c0_g1 1 TRINITY_DN101_c0_g2 1 TRINITY_DN102_c0_g1 1 TRINITY_DN102_c0_g2 1  ADD COMMENT 0 Entering edit mode I am using bash not R ADD REPLY 1 Entering edit mode Bash solution: awk '{h[$1]++}; END { for(k in h) print k, h[k] }' Trinity.fasta.gene_trans_map.txt > out_trinity


source

ps: since several trinity wrappers run on R, I would strongly suggest you to learn some

0
Entering edit mode

I do not have Trinity.fasta.gene_trans_map.txt I have Trinity.fasta !

1
Entering edit mode

I used this file as input

$awk 'sub(/^>/, "")' Trinity.fasta > header.trinity # get the fasta headers$ head header.trinity

TRINITY_DN10_c0_g1_i1 len=334 path=[312:0-333] [-1, 312, -2]
TRINITY_DN11_c0_g1_i1 len=319 path=[297:0-318] [-1, 297, -2]
TRINITY_DN12_c0_g1_i1 len=244 path=[222:0-243] [-1, 222, -2]
TRINITY_DN17_c0_g1_i1 len=229 path=[207:0-228] [-1, 207, -2]
TRINITY_DN18_c0_g1_i1 len=633 path=[611:0-632] [-1, 611, -2]
TRINITY_DN18_c1_g1_i1 len=289 path=[267:0-288] [-1, 267, -2]
TRINITY_DN19_c0_g1_i1 len=283 path=[261:0-282] [-1, 261, -2]
TRINITY_DN21_c0_g1_i1 len=242 path=[220:0-241] [-1, 220, -2]

$cat header.trinity | sed 's/_i.*//' > gene.header.trinity # keep only the portion of the header representing the gene_id (example TRINITY_DN21_c0_g1).$ head gene.header.trinity # check the new headers
TRINITY_DN10_c0_g1
TRINITY_DN11_c0_g1
TRINITY_DN12_c0_g1
TRINITY_DN17_c0_g1
TRINITY_DN18_c0_g1
TRINITY_DN18_c1_g1
TRINITY_DN19_c0_g1
TRINITY_DN21_c0_g1
TRINITY_DN22_c0_g1
TRINITY_DN43_c0_g1

$awk '{h[$1]++}; END { for(k in h) print k, h[k] }' gene.header.trinity > final_output # count how many time a string appears

TRINITY_DN554_c0_g1 1
TRINITY_DN502_c0_g1 1
TRINITY_DN454_c0_g1 1
TRINITY_DN402_c0_g1 1
TRINITY_DN402_c4_g1 1
TRINITY_DN402_c0_g2 1
TRINITY_DN302_c0_g1 1
TRINITY_DN354_c0_g1 1
TRINITY_DN354_c4_g1 1
TRINITY_DN110_c0_g1 1

0
Entering edit mode

I think you'll need to run this script to get that file: https://github.com/trinityrnaseq/trinityrnaseq/blob/18bb3182005a61d236525185d8dda65867ca4cc0/util/align_and_estimate_abundance.pl

Try:

perl align_and_estimate_abundance.pl --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference

0
Entering edit mode

it says 'Can't open perl script "align_and_estimate_abundance.pl": No such file or directory'

2
Entering edit mode

My bad - don't type the command exactly as I gave it to you. Understand it and make necessary changes so it makes sense in your context.

0
Entering edit mode

Are you opposed to using R?

0
Entering edit mode

no, but I need to solve it by bash now if it possible

0
Entering edit mode
15 months ago
h.mon 34k

The output of a Trinity fasta assembly has headers following a convention, see the documentation for details:

>TRINITY_DN17755_c0_g1_i1 len=221 path=[0:0-220]
>TRINITY_DN17876_c0_g1_i1 len=254 path=[0:0-253]
>TRINITY_DN17736_c0_g1_i1 len=413 path=[0:0-412]


The c, g and i stand for clusters, genes and isoforms - note those are somewhat loosely determined, as short reads transcriptome assemblies are very noisy in general.

To get the single "gene" with the most "isoforms", you can use standard linux command line utilities:

grep ">" Trinity.fasta \
| cut -d " " -f1 \
| sort -t"_" -n -r -k5.2 \


To get a list of the "N" genes with the most isoforms would be a bit more complicated, and probably involve some scripting, parsing, sorting and filtering the fields from the fasta headers.

0
Entering edit mode

ok, when I ran this command on the Trinity fasta that you sent I got this 'TRINITY_DN17876_c0_g1_i1' could you explain to me why I get especially this gene? also what is the meaning of -k5.2 in this command

0
Entering edit mode

what is the meaning of -k5.2 in this command

EDIT: Looks like 5.2 is not a difference-in-decimal-separator version of 5,2 but a whole different thing. Please see h.mon's comment below

It's probably because some countries use , to represent decimals and . to represent the comma separator. The manual will help you understand -k5,2, and that should translate to the -k5.2 used here.

0
Entering edit mode

ok is there is a way to get all the genes not just the one with the highest isoforms?

0
Entering edit mode

Read the manual pages of each shell command used, and you'll understand what to do so you see all entries, not just the first.

0
Entering edit mode

I have no idea why you get a gene with one isoform as a result. I get >TRINITY_DN1104_c0_g1_i372. Unless all the transcripts from your assembly have only one isoform, you shoud get something with a high number of isoforms.

Regarding -k, reading man sort shows:

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.

So -k5.2 means sort by the 5th field, beginning from the second character from that field. As I stated "_" as separator, the 5th field is the isoform number, and the sorting ignores the i prefix.

1
Entering edit mode

Side note: -k5Vr would work as well - version sort is great for alphanumeric content where we want the alphabet portion sorted alphabetically and the numeric portion sorted numerically.

0
Entering edit mode

in this example what should I do to know how many assembly clusters do we have in the assembled transcriptome?