I have RNAseq data from a co-culture of two bacteria. I also have assembled, annotated genomes for each bacterium (A has 2 contigs; B has 3). I have run hisat2-build
on a combined database (5 contigs, named A_1:2 and B_1:3) and have a .sam
file of the aligned reads. I aim to run htseq-count
on the .sam
file. However, I have a different genome feature file for each genome. It makes sense to combine them prior to running htseq-count
, but I'm worried that the program throw errors since the coordinates of the two genomes will overlap. Below are previews of the gff
s (with carriage returns added for readability). Will htseq-count
overcount/throw an error because of the overlapping coordinates? Or will it be okay since the first column of each gff
file has a contig identifier with the taxa name (A_1
vs. B_1
)?
head A_genome.gff
##gff-version 3
##sequence-region A_1 1 3350537
##sequence-region A_2 1 791720
A_1 Prodigal:002006 CDS 259 618 . - 0 ID=n_00001;inference=ab initio prediction:Prodigal:002006;locus_tag=n_00001;product=hypothetical protein
A_1 Prodigal:002006 CDS 725 1828 . - 0 ID=n_00002;eC_number=2.7.2.11;Name=proB_1;db_xref=COG:COG0263;gene=proB_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7B5;locus_tag=n_00002;product=Glutamate 5-kinase
head B_genome.gff
##gff-version 3
##sequence-region B_1 1 3532203
##sequence-region B_2 1 1915494
##sequence-region B_3 1 337275
B_1 Prodigal:002060 CDS 798 1475 . + 0 ID=n_00001;inference=ab initio prediction:Prodigal:002060;locus_tag=n_00001;product=hypothetical protein
B_1 Prodigal:002060 CDS 1901 2113 . - 0 ID=n_00002;inference=ab initio prediction:Prodigal:002060;locus_tag=n_00002;product=hypothetical protein
You may want to consider doing alignments following "binning" the reads using a tool like
bbsplit.sh
. BBSplit syntax for generating builds for the reference genome and how to call different builds.Wonderful, thank you so much! I'll check that out.