I have a genome that I am annotating, and there are no reference assemblies or closely related species with genomes sequenced.
I am very pleased with my genome assembly. I used hifiasm and juicer and I used quast to assess the quality of my assembly. The busco scores for my assembly are as follows:
C:98.7%[S:94.7%,D:4.0%],F:0.5%,M:0.8%,n:1614
I softmasked my genome with EDTA and now I am using braker3 to annotate my genome. I have a lot of RNA seq data that I generated and aligned using STAR with these parameters:
STAR --runThreadN 32 \
--genomeDir "$GENOME_DIR" \
--readFilesIn "$READ1" "$READ2" \
--readFilesCommand gunzip -c \
--twopassMode Basic \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--outFileNamePrefix "${OUTDIR}/${sample}_"
I also included protein information from the busco protein dataset and a fasta file of monocot proteins that I got off phytozome. Here is my braker command:
apptainer exec -B ${PWD}:${PWD} ${BRAKER_SIF} /opt/BRAKER/scripts/braker.pl \
--genome=/home/genome.fa \
--bam=/home/mergedRNA.sorted.bam \
--workingdir=${wd} \
--softmasking \
--gff3 \
--species=nameOfSpeciesModel \
--GENEMARK_PATH=${ETP}/gmes \
--threads 8 \
--prot_seq= proteins_odb10_plants.fa, mastaFasta-protein.filtered.fa \
--AUGUSTUS_CONFIG_PATH=/home/augustus_config/config \
&> species1.log
When I tested the busco scores of my genome annotations here are the results:
C:96.7%[S:76.0%,D:20.8%],F:0.7%,M:2.6%,n:1614
1561 Complete BUSCOs (C)
1226 Complete and single-copy BUSCOs (S)
335 Complete and duplicated BUSCOs (D)
11 Fragmented BUSCOs (F)
42 Missing BUSCOs (M)
1614 Total BUSCO groups searched
What could have caused this large increase in duplicated genes amongst the genome annotations? From what I understand it is not abnormal to see a small increase in duplication number from the assembly to annotation busco scores, but this seems much too great and suggests I have gone wrong in at least one step.
Thank you for the advice. Does this mean that I should clean the merged RNA file I have to check that there are not two copies of the same gene, or clean an output file after running braker?
I mean that you should clean the output file after running braker. When I last ran braker (a year ago) the final gene set (their protein sequences) was written to a file named
braker.aa. It's this file that you have to clean up before giving it as input to BUSCO... If you open that file and look at the fasta headers you can usually understand if there are multiple transcripts for a gene; search for gene IDs that differ in the lasttbit. For example, if you find genes namedg123.t1andg123.t2then these two are transcripts of theg123gene.