I'm using STAR for mapping bacterial genomes.
I obtained the input data from antismash (GBK; n=1911) which I merged and converted into GFF3 and FASTA formats.
I'm retrieving an error when generating the index with the following code:
STAR --runThreadN 30 --runMode genomeGenerate --genomeDir /path_to_genome_dir --genomeFastaFiles /path_to_fasta.fna --sjdbGTFfile /path_to_GFF.gff --sjdbOverhang 149 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS --alignIntronMax 1 --genomeSAindexNbases 12
(Genome length: 51.097.461 bp)
The error is the following:
Aug 10 11:16:18 ..... started STAR run Aug 10 11:16:18 ... starting to generate Genome files Aug 10 11:16:23 ... starting to sort Suffix Array. This may take a long time... Aug 10 11:16:26 ... sorting Suffix Array chunks and saving them to disk... Aug 10 11:16:58 ... loading chunks from disk, packing SA... Aug 10 11:17:02 ... finished generating suffix array Aug 10 11:17:02 ... generating Suffix Array index Aug 10 11:17:12 ... completed Suffix Array index Aug 10 11:17:12 ..... processing annotations GTF terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check Abortado (`core' generado)
I found a previous question about the same error message, but it remains unanswered.
Reading the messages of an issue in the tool's github, I realised that the ninth column of the GFF does not contain "transcript_id" or "gene_id". There it appears:
STAR requires "transcript_id" in each "exon" line of the GTF file. Otherwise it would not know which transcript the exon belongs to, and won't be able to make junctions. I will replace this hard requirement with a warning in the next release. (12-Nov-2015)
In the manual (4-May-2021), it seems that this is solved with the parameter
In addition to the aforementioned options, for GFF3 formatted annotations you need to use --sjdbGTFtagExonParentTranscript Parent. In general, for --sjdbGTFfile files STAR only processes lines which have --sjdbGTFfeatureExon (=exon by default) in the 3rd field (column). The exons are assigned to the transcripts using parent-child relationship defined by the --sjdbGTFtagExonParentTranscript (=transcript id by default) GTF/GFF attribute.
However, it appears that this is not the issue, as "transcript id" and "exon" have been modified to "Parent" and "CDS", respectively.
But if it was, would it be enough to replace "ID" for "transcript_id"? Here a line of how it looks like:
ISL001_ctg1 GenBank CDS 8645 8890 . - 1 ID=ctg1_124; nRPS_PKS=Domain: PP-binding (3-72). E-value: 5.9e-16. Score: 50.6. Matches aSDomain: nrpspksdomains_ctg1_124_PP-binding.1, type: other; Name=ctg1_124; gene_functions=biosynthetic-additional (rule-based-clusters) PP-binding; gene_kind=biosynthetic-additional; sec_met_domain=PP-binding (E-value: 3.3e-15%2C bitscore: 50.6%2C seeds: 164%2C tool: rule-based-clusters); transl_table=11; translation=length.81
Another imprecision is that the ID refers to the name of the gene but not to the genome where it came from. The correct gene_id is ISL001_ctg1_124. Does the first column account for this, without needing any more details in the ninth to properly map?