Question

Is This A Valid Gff3 File For Htseq?

0

Entering edit mode

11.0 years ago

biotech ▴ 570

Hi there,

I created a gff3 file from GenBank file containing multiple contigs (using Bioperl 'genbank2gff3.pl' tool) . In order to run HTSeq I removed the ##FASTA section (HTSeq doesn't likes). My gff3 file is missing the headers at the start of the file but HTSeq seems doesn' t mind.

I have a result from HTSeq but I'm not sure I did something wrong.

GFF3 file is attached here: https://docs.google.com/file/d/0B8-ZAuZe8jldNUNhZkdVbV9YQXM/edit?usp=sharing

Thank you for your help.

Best, Bernardo

python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID R_2012_09_17_05_52_53_user_PGM-18-BB_Auto_PGM-18-BB_19_2.sam HPNK.gbf.gff > HPNK_2.counts

Link to genbank2gff3.pl script:

https://github.com/bioperl/bioperl-live/tree/master/scripts/Bio-DB-GFF https://github.com/bioperl/bioperl-live/blob/master/scripts/Bio-DB-GFF/bp_genbank2gff3.pl

gff3 rnaseq bacteria • 7.6k views

ADD COMMENT • link 11.0 years ago by biotech ▴ 570

0

Entering edit mode

Github link not working

ADD REPLY • link 11.0 years ago by PoGibas 5.1k

0

Entering edit mode

The link is now updated, sorry.

ADD REPLY • link 11.0 years ago by biotech ▴ 570

0

Entering edit mode

What is it about the results that makes you think you did something wrong?

ADD REPLY • link 11.0 years ago by Ryan Dale 5.0k

0

Entering edit mode

The result seems to be OK. I ran edgeR and there are meaningful things. I want to be sure there is nothing wrong since we will invest some money to validate edgeR results.

ADD REPLY • link 11.0 years ago by biotech ▴ 570

0

Entering edit mode

Your commands look fine and the GFF looks fine. One way to try verify things work as expected would be to construct a small GTF format version of the first few genes, and then run that to ensure the results are identical to those same genes in the GFF version.

ADD REPLY • link 11.0 years ago by Ryan Dale 5.0k

0

Entering edit mode

Commenting here so it is noted: I forgot to mention this, but the bioperl script used is best described as a 'first-pass' for getting Genbank into GFF3 format; it's not perfect by any stretch of the imagination, though it tries. If your genome is available in the NCBI FTP downloads there are normally autogenerated GFF3 files NCBI provides that are actually quite good, but this may not be the case for your genome (appears to be WGS).

ADD REPLY • link 11.0 years ago by Chris Fields ★ 2.2k

score 0 · Answer 1 · 2013-10-18

0

Entering edit mode

11.0 years ago

Chris Fields ★ 2.2k

The htseq docs indicate the GFF version parsed is GFF v2 or GTF; the latter is much more reliable. I don't believe htseq is designed to work with GFF3 yet. I know from personal experience it parses GFF3, but I don't believe the logic used re: gene-transcript relations always works as expected with how GFF3 is designed (for instance, features with CDS-UTR instead of exon features).

Correct github address for the perl script mentioned above is here.

ADD COMMENT • link 11.0 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

HTSeq's logic for gene-transcript relations is pretty simple -- any features of of the "featuretype" (by default exon) are considered members of the same "gene" (by default gene_id). In the OP's example commands, all CDSs that have the same ID will be considered as coming from the same gene, and exons and UTRs will be ignored.

ADD REPLY • link 11.0 years ago by Ryan Dale 5.0k

1

Entering edit mode

The danger is in implying GFF3 can be handled by htseq-count. There is no mention of GFF3 support anywhere for GFF3 in their docs. Also, I can think of a least three instances where htseq-count would have problems:

As mentioned above, any time a parent feature has multiple child features (CDS/UTR, instead of exon) will not work AFAIK, unless htseq-count can use multiple feature types (I think it only handles one feature type at the moment). Any reason to leave out UTR? It's part of the transcript, and this could skew FPKM values (not a big deal if comparing across samples, but a problem if comparing within).
In many cases a feature can have multiple Parent features (an exon present in multiple isoforms for instance), so using a 'Parent' tag as the aggregator for feature types could fail unless htseq is accounting for this somehow.
As you mention, in GTF exons or CDS are tied to both transcripts (via transcript_id) and gene (via gene_id), so calculating values for transcript and gene is easy if parsing only those features. This doesn't always hold true for GFF3, where exons are part of one or more transcript, and each transcript belongs to a gene (multi-layer hierarchy). Getting gene-level counts would involve finding out all exons present for each gene (which may mean getting all this from the transcript level info). Some GFF3 may have custom tags for that, some won't, but it's definitely not consistent.

ADD REPLY • link 11.0 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Good points. As a side note, the GFF3s I've used use the featuretype "exon" to include CDSs and UTRs, even though CDSs and UTRs are also annotated separately on separate lines. So for those files, using just exon actually does include all transcribed parts of the gene. But as you imply, GFF3 files are far from standardized, leading to the issues you describe. I guess the trick is to get to know your GFF3 file well to make sure problems like this won't happen!

ADD REPLY • link 11.0 years ago by Ryan Dale 5.0k

0

Entering edit mode

Hi Chris, I used 'genbank2gff3.pl' script because I didn't find anything better to go from some genbank files to a single GFF3 file. I now HTSeq doesn't like GFF3. I assume I can get rid of the first 6 lines.

ADD REPLY • link 11.0 years ago by biotech ▴ 570

score 0 · Answer 2 · 2013-10-18

0

Entering edit mode

11.0 years ago

JacobS ▴ 990

Does HTSeq even use the gene-transcript relation during its operation? Since HTSeq requires us to only specify the feature and name of the gene_id field we want to use to identify the feature, I always assumed that HTSeq strictly reports whatever reads fall within the chosen feature's coordinates according to the scoring schemes: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html. If it does support gene-transcript, it must optionally switch this on if the annotation contains the information, because it isn't necessary for basic operation, like @Chris said.

However, I'm not sure if any of this really matters for your case, since you are working with bacteria, which won't have multiple CDS per gene anyway. Instead of choosing the CDS as your feature of interest, why not choose gene? In every case I saw in your GFF3, the coordinates were the same for the CDS and gene features, and that way your IDs associated with your feature will be something like HPNK_00020 instead of HPNK_00020.p01. If someone thinks this is inaccurate, please correct me, but I can't really think of a reason why you even need the gene-transcript negotiation for a case like this.

ADD COMMENT • link 11.0 years ago by JacobS ▴ 990

0

Entering edit mode

Thanks jacobS, using 'gene' makes sense.

ADD REPLY • link 11.0 years ago by biotech ▴ 570

0

Entering edit mode

No problem, hope it helps. And yes, I think you can assume that your output is just fine given the commands and situation you explained.

ADD REPLY • link 11.0 years ago by JacobS ▴ 990

0

Entering edit mode

Hi jacobS,

Because of choosing CDS, i was loosing expression information about the features not having CDS, like tRNAs, so the way to go, at least for me, is:

python -m HTSeq.scripts.count -m intersection-nonempty -t gene -i locus_tag IonXpressRNA_014_R_2013_03_14_10_59_00_user_PGM-67-BB_pool_Auto_user_PGM-67-BB_pool_68.sam HS327.gff > IonXpressRNA_014.counts

insted of:

python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID IonXpressRNA_014_R_2013_03_14_10_59_00_user_PGM-67-BB_pool_Auto_user_PGM-67-BB_pool_68.sam HS327.gff > IonXpressRNA_014.counts

ADD REPLY • link 11.0 years ago by biotech ▴ 570

score 0 · Answer 3 · 2013-10-18

0

Entering edit mode

11.0 years ago

biotech ▴ 570

So, can I assume my HTSeq output is fine with this 'custom' GFF3 file?

ADD COMMENT • link 11.0 years ago by biotech ▴ 570

0

Entering edit mode

For this case (genes == counts), it's likely okay. Any reads spanning genes (like you would find in operons) may be marked as ambiguous based on how you run htseq-count. It's pretty conservative when considering ambiguities.

Just to point out another option: I have heard good things about Rockhopper though I haven't used it myself. It's specifically designed for bacterial RNA-Seq data, and does counts for you as well. Also deals with operon structure, small RNAs, etc. Might be worth a look.

ADD REPLY • link 11.0 years ago by Chris Fields ★ 2.2k