High duplicate BUSCOs after annotation but not assembly
2
0
Entering edit mode
2.2 years ago
jaredbernard ▴ 20

Hi, all. Can anyone tell me why I might have very low duplicate BUSCOs after assembly but quite high after annotation?

After assembly: C:98.1%[S:97.0%,D:1.1%],F:0.6%,M:1.3%,n:2124

After annotation: C:97.0%[S:59.7%,D:37.3%],F:1.0%,M:2.0%,n:2124

By the way, this is an insect sequenced via 10x Genomics and assembled "long reads" with Supernova 2.0.1. I annotated with a close relative as a reference, with Maker set for Eukaryota, and I used BUSCO 5.2.2 with Endopterygota odb10.

According to the recent BUSCO paper, duplicated BUSCOs could be formed from poor assembly of haplotypes, but the level of phasing I got with Supernova should be decent.

Thanks for any advice!

BUSCO orthologs assembly genomics annotation • 3.2k views
ADD COMMENT
2
Entering edit mode

<strike>Sounds like a fragmented annotation.</strike>
You might check what is going on to the BUSCO genes from your assembly to the annotation using agat_sp_compare_two_BUSCOs.pl from AGAT.

ADD REPLY
0
Entering edit mode

Thanks for the quick response, Juke. I'll look into AGAT, although I think I've had problems with both Singularity and Docker.

What exactly do you mean by fragmented annotation? I masked my custom repeat library and did 3 rounds of Maker with a close relative as protein evidence, as well as the Swiss-Prot omnibus. Each round used Augustus and SNAP trained on the prior round. So I'm not sure where the problem would occur, or what would be done differently. (I've also done downstream GO work, but may need to redo this depending on what the issue is.)

ADD REPLY
0
Entering edit mode

My first thought was fragmented annotation (i.e. a gene seen complete in the assembly annotated in several pieces in the annotation). But I'm wrong because it should end up in the fragmented part of the BUSCO which is not your case. Using agat_sp_compare_two_BUSCOs.pl you will probably decipher your thought. You need to load the tracks within a genome browser and look at what are the duplicated one found in the annotation. Were they already found by BUSCO in the assembly or it is just new genes.

Your annotation BUSCO score is really good. I think your annotation went well and you annotated "new" genes... are they real duplicates or artifact due to assembly/phasing issues, you should investigate.

ADD REPLY
0
Entering edit mode

Thanks, Juke. I think you're right. I've been using JBrowse to visualize, so I'll take another look. Yes, I'm concerned that the duplicates are legitimate, so I want to be certain before filtering. I will look into AGAT again to see whether it highlights any issues.

ADD REPLY
2
Entering edit mode
2.2 years ago
Michael 54k

Does "after annotation" mean you ran BUSCO on the predicted genes or proteins in transcript or proteome mode? Otherwise, it doesn't make much sense, because the annotation does not alter the assembly. The duplication could be caused by predicted isoforms of the same gene. You need to reduce all genes to their longest isoform to get realistic numbers for single copy on the annotation. Because the scores for the assembly are excellent, I think your assembly is not the problem. Your annotation is likely fine as well, it is just this technical detail.

Edit: As pointed out by Juke34, the single isoform annotation should be used only for obtaining BUSCO scores, and possibly single-copy orthologue finding.

ADD COMMENT
0
Entering edit mode

+1 Indeed you have to remove the isoforms.

ADD REPLY
0
Entering edit mode

Thanks for the feedback, Michael. I used BUSCO on the exons from Maker's final annotation, set on transcriptome mode and selecting the lineage Endopterygota. I think you're right about the isoforms -- I had been wary of filtering out isoforms because I am mostly interested in detecting paralogs, but I'll see what the best method is to do this.

Any suggestions? I'm hoping this can be done post-annotation, but whatever works.

ADD REPLY
1
Entering edit mode

AGAT ^^ there is a ‘keep longest isoforms’ script.

ADD REPLY
0
Entering edit mode

Thanks, Juke34. I used AGAT to reduce my duplicated BUSCOs to 7.7%. However, I wonder if filtering the isoforms on the transcriptome makes more sense than post-annotation. Any thoughts?

ADD REPLY
1
Entering edit mode

If the annotation is made on genome assembly filtering the isoforms post-annotation for BUSCO is the way to go. If the annotation is made on transcriptome assembly then you may also filter the transcriptome as explained here. Basically the problem can be on transcriptomes assembly where several isoforms from close genes might be all grouped in a single gene, or a set of isoforms from a single gene are seen as coming from different genes. When mapped to a genome for genome annotation then most of problems should vanish.

ADD REPLY
0
Entering edit mode

Thanks so much for the advice, Juke. I assembled a transcriptome using Trinity, but I didn't annotate it. Instead I used it as transcript evidence in Maker when annotating a genome. So I suppose filtering the isoforms after genome annotation is probably the way to go.

Just curious: when you say "for BUSCO" do you mean that I would filter isoforms only to correct the BUSCO score, but publish a genome that contains all isoforms? I am concerned about discarding isoforms since they depict alternate splicing.

By the way, I started a new post about this issue to see what others recommend, since it is sort of a tangential topic to this one. :^)

ADD REPLY
1
Entering edit mode

Yes filtering isoform here is usefull only to reflect a proper BUSCO score. For the annotation keep everything. Use agat_sp_statistics.pl to get statistics with and without isoforms.

ADD REPLY
1
Entering edit mode

Optimally, there would be a feature in BUSCO to treat isoforms different from gene copies. In some borderline cases taking the longest isoform might not even be the best choice.

That could, for example, work using a naming convention in the FASTA header, similar to the ENSEMBL FASTA headers containing gene and transcript ids.

I have simple perl scripts for doing the single isoform reduction for both Ensembl and GenBank style Fasta files should there be any need.

ADD REPLY
0
Entering edit mode
2.2 years ago

This is not a direct answer, but maybe you can verify your BUSCO results with MOSGA, uploading both sequences. Generally, the annotation should not necessarily change your sequence and therefore not affect your BUSCO results. But may you will find some more differences.

Just disable the phylogenetic analysis and enable BUSCO and EukCC (as a second genome completeness tool). EukCC only requires a freely GeneMark-ES/ET/EP license.

ADD COMMENT
0
Entering edit mode

Thanks! I'm trying it now.

ADD REPLY

Login before adding your answer.

Traffic: 2428 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6