Hello! I am looking to run a BUSCO analysis on the set of gene (CDS) and protein (proteome) annotation in an assembly but I am confused on the file types to use. The annotation for this organism (Phaeodactylum tricornutum [Phatr2]), which is available from JGI (Joint Genome Institute) PhycoCosm, lists data for "All models" and "Filtered Models". Based on their description, all models may include redundant model sets for each locus, whereas filtered models only contain the best gene model available.
Under these different categories, files for genes, proteins and transcripts are available. Files under the "Genes" category are all GFF files, while FASTA files are available under "Proteins" and "Transcripts".
Example, from the filtered models dataset in Phatr2:
- Genes: Phatr2.all_proteins.FilteredModels2.gff3.gz.tar.gz (No FASTA files, only GFF annotation files).
- Proteins: Phatr2_chromosomes_geneModels_FilteredModels2_aa.fasta.gz (GFF annotation files are available as well)
- Transcripts: Phatr2_chromosomes_geneModels_FilteredModels2_nt.fasta.gz (No GFF annotation files, only FASTA files).
- Why do some categories list GFF files (e.g. Genes) and others only FASTA files (e.g. Transcripts)?
- NCBI Genome lets the user download the representative FASTA format for genome/transcript/proteins, which I have previously used for BUSCO runs. How do these files differ from those published by JGI? Are those from NCBI filtered or all gene models?
- As I am running BUSCO, will FASTA files for "All models" affect the run in any way? I.e. Will redundant models lead to BUSCO identifying it as duplications? (I want to run BUSCO on the total set of genes, but JGI does not provide a FASTA file for me to do so).
- Are filtered gene models related to variants in anyway?
As I am new to the topic of genome annotation as a whole, I would really appreciate some help in understanding the nomenclature for these files.