I created a gff3 file from GenBank file containing multiple contigs (using Bioperl 'genbank2gff3.pl' tool) . In order to run HTSeq I removed the ##FASTA section (HTSeq doesn't likes). My gff3 file is missing the headers at the start of the file but HTSeq seems doesn' t mind.
I have a result from HTSeq but I'm not sure I did something wrong.
The result seems to be OK. I ran edgeR and there are meaningful things. I want to be sure there is nothing wrong since we will invest some money to validate edgeR results.
Your commands look fine and the GFF looks fine. One way to try verify things work as expected would be to construct a small GTF format version of the first few genes, and then run that to ensure the results are identical to those same genes in the GFF version.
Commenting here so it is noted: I forgot to mention this, but the bioperl script used is best described as a 'first-pass' for getting Genbank into GFF3 format; it's not perfect by any stretch of the imagination, though it tries. If your genome is available in the NCBI FTP downloads there are normally autogenerated GFF3 files NCBI provides that are actually quite good, but this may not be the case for your genome (appears to be WGS).
The htseq docs indicate the GFF version parsed is GFF v2 or GTF; the latter is much more reliable. I don't believe htseq is designed to work with GFF3 yet. I know from personal experience it parses GFF3, but I don't believe the logic used re: gene-transcript relations always works as expected with how GFF3 is designed (for instance, features with CDS-UTR instead of exon features).
Correct github address for the perl script mentioned above is here.
HTSeq's logic for gene-transcript relations is pretty simple -- any features of of the "featuretype" (by default exon) are considered members of the same "gene" (by default gene_id). In the OP's example commands, all CDSs that have the same ID will be considered as coming from the same gene, and exons and UTRs will be ignored.
The danger is in implying GFF3 can be handled by htseq-count. There is no mention of GFF3 support anywhere for GFF3 in their docs. Also, I can think of a least three instances where htseq-count would have problems:
As mentioned above, any time a parent feature has multiple child
features (CDS/UTR, instead of exon) will not work AFAIK,
unless htseq-count can use multiple feature types (I think it only
handles one feature type at the moment). Any reason to leave out
UTR? It's part of the transcript, and this could skew FPKM values
(not a big deal if comparing across samples, but a problem if
comparing within).
In many cases a feature can have multiple Parent features (an exon
present in multiple isoforms for instance), so using a 'Parent' tag
as the aggregator for feature types could fail unless htseq is
accounting for this somehow.
As you mention, in GTF exons or CDS are tied to both transcripts
(via transcript_id) and gene (via gene_id), so calculating
values for transcript and gene is easy if parsing only those
features. This doesn't always hold true for GFF3, where exons are
part of one or more transcript, and each transcript belongs to a
gene (multi-layer hierarchy). Getting gene-level counts would
involve finding out all exons present for each gene (which may mean getting all this from the transcript level info). Some GFF3 may
have custom tags for that, some won't, but it's definitely not consistent.
Good points. As a side note, the GFF3s I've used use the featuretype "exon" to include CDSs and UTRs, even though CDSs and UTRs are also annotated separately on separate lines. So for those files, using just exon actually does include all transcribed parts of the gene. But as you imply, GFF3 files are far from standardized, leading to the issues you describe. I guess the trick is to get to know your GFF3 file well to make sure problems like this won't happen!
Hi Chris, I used 'genbank2gff3.pl' script because I didn't find anything better to go from some genbank files to a single GFF3 file. I now HTSeq doesn't like GFF3. I assume I can get rid of the first 6 lines.
Does HTSeq even use the gene-transcript relation during its operation? Since HTSeq requires us to only specify the feature and name of the gene_id field we want to use to identify the feature, I always assumed that HTSeq strictly reports whatever reads fall within the chosen feature's coordinates according to the scoring schemes: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html. If it does support gene-transcript, it must optionally switch this on if the annotation contains the information, because it isn't necessary for basic operation, like @Chris said.
However, I'm not sure if any of this really matters for your case, since you are working with bacteria, which won't have multiple CDS per gene anyway. Instead of choosing the CDS as your feature of interest, why not choose gene? In every case I saw in your GFF3, the coordinates were the same for the CDS and gene features, and that way your IDs associated with your feature will be something like HPNK_00020 instead of HPNK_00020.p01. If someone thinks this is inaccurate, please correct me, but I can't really think of a reason why you even need the gene-transcript negotiation for a case like this.
For this case (genes == counts), it's likely okay. Any reads spanning genes (like you would find in operons) may be marked as ambiguous based on how you run htseq-count. It's pretty conservative when considering ambiguities.
Just to point out another option: I have heard good things about Rockhopper though I haven't used it myself. It's specifically designed for bacterial RNA-Seq data, and does counts for you as well. Also deals with operon structure, small RNAs, etc. Might be worth a look.
Github link not working
The link is now updated, sorry.
What is it about the results that makes you think you did something wrong?
The result seems to be OK. I ran edgeR and there are meaningful things. I want to be sure there is nothing wrong since we will invest some money to validate edgeR results.
Your commands look fine and the GFF looks fine. One way to try verify things work as expected would be to construct a small GTF format version of the first few genes, and then run that to ensure the results are identical to those same genes in the GFF version.
Commenting here so it is noted: I forgot to mention this, but the bioperl script used is best described as a 'first-pass' for getting Genbank into GFF3 format; it's not perfect by any stretch of the imagination, though it tries. If your genome is available in the NCBI FTP downloads there are normally autogenerated GFF3 files NCBI provides that are actually quite good, but this may not be the case for your genome (appears to be WGS).