Question

running BUSCO on all isoforms or longest isoform per gene?

0

Entering edit mode

6.0 years ago

Farbod ★ 3.4k

Hi Biostars,

In order to assess the completeness of de novo transcriptome assembly, I have used BUSCO_v3.

As my assembly belongs to a fish I have used actinopterygii_odb9 as my lineage dataset.

I have ran BUSCO once on my whole assembly (in Trinity usually most genes have several isoforms, I keep them all in this try) and once on longest isoform per each gene, only.

Obviously, the duplication rate was decrease in the second approach.

Q: Which one is the correct approach resulting into more biologically meaningful answer?

Thanks

NOTE: most genes are already duplicated in fishes. ;-)

MY BUSCO Script:

python scripts/run_BUSCO.py -i Trinity.fasta  -o OUTPUT_All_isoforms -l actinopterygii_odb9 -m tran --cpu 8

assembly RNA-Seq • 3.0k views

ADD COMMENT • link updated 6.0 years ago by gilbert.bionet ▴ 160 • written 6.0 years ago by Farbod ★ 3.4k

score 2 · Answer 1 · 2018-05-01

You ask: "Which [way gives a] more biologically meaningful answer?" Don't leave out the other aspect: computationally meaningful.

Since you measured both (all isoforms vs longest), did you find a score difference? If you find a different BUSCO score for Missing+Fragmented conserved genes, then "all" has fewer misses, and the computational details matter.

Here are computational points: point 1. Use busco -m protein, as busco.py transcript translation mode is flaky, not to be trusted. See below.

point 2. Measure all isoforms, as closest homology is often enough for a shorter isoform. You can then rescore the BUSCO summary using only best-homology isoform per locus, if you want to remove those isoform-Duplicate counts.

correlary of 1. Dont measure longest transcript, but longest protein if you must do only one/locus. Longest transcripts are often those with artifacts, joins/chimera and insertions in coding sequence that break their protein homology, while making them longer.

point 3. Busco's Single/Duplicate measure is not very useful, as most gene sets under-report paralogs. Paralogs are harder to reconstruct, and are often left out of gene sets, making the 'single-copy' estimate of OrthoDB a computational rather than biological criterion. Also distinguishing locus alternates and paralogs is tricky even with a good chromosome assembly to map loci; alternate isoforms can look like paralogs, and vice versa. My recommendation is just ignore the BUSCO single/duplicate distinction. Missing and fragmented conserved genes are the ones to be concerned with.

p1 details: the BUSCO.py -m tran (transcript mode) has a very poor (quick and dirty) method of translating transcripts in all frames, in pieces, into proteins. You should instead use an accurate transcript to protein translator, and run BUSCO software in protein mode to get accurate answers, ones that match what other homology assessments, or public uses, of your transcripts will be.

Do the test yourself, you get different BUSCO results from -m tran versus -m protein. The reason is transcripts can have many kinds of artifacts that scramble their coding sequences, and can have parts of coding sequences mashed together in different ways.

-- Don Gilbert

Disclosure: I develop/provide accurate gene reconstruction software called EvidentialGene , and I pay attention to such details of gene data informatics.

score 0 · Answer 2 · 2018-04-26

0

Entering edit mode

6.0 years ago

lieven.sterck 15k

Running it once using the longest isoform is the most appropriate way to go. This will give you the result you're looking for == "to asses the completeness of your assembly result".

One thing you might consider is to first do ORF prediction on the transcripts and run the resulting proteins through BUSCO (as the built-in tools to predict genes in BUSCO is less sensitive on transcripts)

ADD COMMENT • link 6.0 years ago by lieven.sterck 15k

0

Entering edit mode

Hi and thanks,

As genome duplication sometimes can result in similar genes with different function, do you think it is harmless to remove all other isoforms (duplicated genes, alternative splicing?), here?

I mean the species has in fact a high percentage of duplication, is it OK to decrease it intentionally?

ADD REPLY • link 6.0 years ago by Farbod ★ 3.4k

2

Entering edit mode

No, then you're taking it a step to far!

duplicated genes should be left as they are. Moreover, I would certainly not catalog duplicate genes as isoforms!! They are distinct gene loci, isoforms are distinct transcripts from the same gene loci (== alternative splicing) which is a totally different thing).

ADD REPLY • link 6.0 years ago by lieven.sterck 15k