Falsely high busco score after genome annotation with braker
2
1
Entering edit mode
5 hours ago
Wilber0x ▴ 60

I have a genome that I am annotating, and there are no reference assemblies or closely related species with genomes sequenced.

I am very pleased with my genome assembly. I used hifiasm and juicer and I used quast to assess the quality of my assembly. The busco scores for my assembly are as follows:

C:98.7%[S:94.7%,D:4.0%],F:0.5%,M:0.8%,n:1614

I softmasked my genome with EDTA and now I am using braker3 to annotate my genome. I have a lot of RNA seq data that I generated and aligned using STAR with these parameters:

       STAR --runThreadN 32 \
         --genomeDir "$GENOME_DIR" \
         --readFilesIn "$READ1" "$READ2" \
         --readFilesCommand gunzip -c \
         --twopassMode Basic \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMstrandField intronMotif \
         --outFileNamePrefix "${OUTDIR}/${sample}_"

I also included protein information from the busco protein dataset and a fasta file of monocot proteins that I got off phytozome. Here is my braker command:

apptainer exec -B ${PWD}:${PWD} ${BRAKER_SIF} /opt/BRAKER/scripts/braker.pl \
  --genome=/home/genome.fa \
  --bam=/home/mergedRNA.sorted.bam \
  --workingdir=${wd} \
  --softmasking \
  --gff3 \
  --species=nameOfSpeciesModel \
  --GENEMARK_PATH=${ETP}/gmes \
  --threads 8 \
  --prot_seq= proteins_odb10_plants.fa, mastaFasta-protein.filtered.fa \
  --AUGUSTUS_CONFIG_PATH=/home/augustus_config/config \
  &> species1.log 

When I tested the busco scores of my genome annotations here are the results:

    C:96.7%[S:76.0%,D:20.8%],F:0.7%,M:2.6%,n:1614      
    1561    Complete BUSCOs (C)                        
    1226    Complete and single-copy BUSCOs (S)        
    335     Complete and duplicated BUSCOs (D)         
    11      Fragmented BUSCOs (F)                      
    42      Missing BUSCOs (M)                         
    1614    Total BUSCO groups searched        

What could have caused this large increase in duplicated genes amongst the genome annotations? From what I understand it is not abnormal to see a small increase in duplication number from the assembly to annotation busco scores, but this seems much too great and suggests I have gone wrong in at least one step.

braker genomeannotation genome annotation braker3 • 139 views
ADD COMMENT
2
Entering edit mode
4 hours ago
Panos ★ 1.9k

I would check whether there are alternative transcripts of the same gene. Before running BUSCO you should first "clean" your gene set so that you keep only one transcript per gene. For the purpose of BUSCO it doesn't really matter which one you keep (I usually keep the largest).

ADD COMMENT
0
Entering edit mode

Thank you for the advice. Does this mean that I should clean the merged RNA file I have to check that there are not two copies of the same gene, or clean an output file after running braker?

ADD REPLY
2
Entering edit mode

I mean that you should clean the output file after running braker. When I last ran braker (a year ago) the final gene set (their protein sequences) was written to a file named braker.aa. It's this file that you have to clean up before giving it as input to BUSCO... If you open that file and look at the fasta headers you can usually understand if there are multiple transcripts for a gene; search for gene IDs that differ in the last t bit. For example, if you find genes named g123.t1 and g123.t2 then these two are transcripts of the g123 gene.

ADD REPLY
1
Entering edit mode
4 hours ago
Dave Carlson ★ 2.2k

When you ran Busco, what lineage did you pick? I've seen cases where picking a lineage that is too broad leads to high apparent rates of duplication, presumably because it doesn't account for gene duplication events that occurred in a subset of taxa within that lineage. Try picking a lineage that includes your species but is more taxonomically exclusive. That might help. It's worth trying anyway.

ADD COMMENT
0
Entering edit mode

Thanks for the tip, I am sure that it could be useful in the future. I am using the Embryophyta dataset for both busco runs at the moment, and this is the narrowest lineage including my species.

ADD REPLY

Login before adding your answer.

Traffic: 3175 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6