My lab recently finished an experiment in which a community of three bacterial species were grown together and then their RNA was Illumina sequenced. From my perspective, this means that I have .fastq files containing reads from all three species. The plan is thus to parse these species out during alignment. I have quality reference genomes for each of the three species.
I have used STAR previously to align to a single bacterial genome, which worked great. However, while STAR can take multiple .fasta files for the reference, it can only accept a single .gtf annotation. I'm wondering what to do. I see two main possibilities:
do three separate alignments on each set of data, with each different alignment having a different reference genome. The downside here is that many of the genes in these species are well (but not perfectly) conserved, and so I think it is likely that this will result in many false positives, where reads are assigned to the wrong species. I guess this could be dealt with down-stream, by finding all reads which mapped to more than one reference, and 'giving' them to the species with the highest alignment confidence, but this seems messy.
I think that the better option is to figure out a way to combine the gtfs from my three reference species into a single gtf. I have not ever done this before. It seems like cellranger (https://github.com/10XGenomics/cellranger) seems to do this, but I can't find reviews of the package. Is it as simple as doing a cat command, then scrolling through the result to delete the headers of the second and third gtf?
Has anyone else combined gtfs successfully for this purpose: to align community-derived RNA-Seq reads against multiple prokaryote reference genomes? I see there is a similar question here from five years ago, which has no answers: Combining Gtf Files