Question: Is This A Valid Gff3 File For Htseq?
0
gravatar for biotech
6.8 years ago by
biotech540
United States
biotech540 wrote:

Hi there,

I created a gff3 file from GenBank file containing multiple contigs (using Bioperl 'genbank2gff3.pl' tool) . In order to run HTSeq I removed the ##FASTA section (HTSeq doesn't likes). My gff3 file is missing the headers at the start of the file but HTSeq seems doesn' t mind.

I have a result from HTSeq but I'm not sure I did something wrong.

GFF3 file is attached here: https://docs.google.com/file/d/0B8-ZAuZe8jldNUNhZkdVbV9YQXM/edit?usp=sharing

Thank you for your help.

Best, Bernardo

python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID R_2012_09_17_05_52_53_user_PGM-18-BB_Auto_PGM-18-BB_19_2.sam HPNK.gbf.gff > HPNK_2.counts

Link to genbank2gff3.pl script:

https://github.com/bioperl/bioperl-live/tree/master/scripts/Bio-DB-GFF https://github.com/bioperl/bioperl-live/blob/master/scripts/Bio-DB-GFF/bp_genbank2gff3.pl

gff3 rnaseq bacteria • 5.4k views
ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by biotech540

Github link not working

ADD REPLYlink written 6.8 years ago by PoGibas4.8k

The link is now updated, sorry.

ADD REPLYlink written 6.8 years ago by biotech540

What is it about the results that makes you think you did something wrong?

ADD REPLYlink written 6.8 years ago by Ryan Dale4.9k

The result seems to be OK. I ran edgeR and there are meaningful things. I want to be sure there is nothing wrong since we will invest some money to validate edgeR results.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by biotech540

Your commands look fine and the GFF looks fine. One way to try verify things work as expected would be to construct a small GTF format version of the first few genes, and then run that to ensure the results are identical to those same genes in the GFF version.

ADD REPLYlink written 6.8 years ago by Ryan Dale4.9k

Commenting here so it is noted: I forgot to mention this, but the bioperl script used is best described as a 'first-pass' for getting Genbank into GFF3 format; it's not perfect by any stretch of the imagination, though it tries. If your genome is available in the NCBI FTP downloads there are normally autogenerated GFF3 files NCBI provides that are actually quite good, but this may not be the case for your genome (appears to be WGS).

ADD REPLYlink written 6.8 years ago by Chris Fields2.1k
0
gravatar for Chris Fields
6.8 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

The htseq docs indicate the GFF version parsed is GFF v2 or GTF; the latter is much more reliable. I don't believe htseq is designed to work with GFF3 yet. I know from personal experience it parses GFF3, but I don't believe the logic used re: gene-transcript relations always works as expected with how GFF3 is designed (for instance, features with CDS-UTR instead of exon features).

Correct github address for the perl script mentioned above is here.

ADD COMMENTlink written 6.8 years ago by Chris Fields2.1k

HTSeq's logic for gene-transcript relations is pretty simple -- any features of of the "featuretype" (by default exon) are considered members of the same "gene" (by default gene_id). In the OP's example commands, all CDSs that have the same ID will be considered as coming from the same gene, and exons and UTRs will be ignored.

ADD REPLYlink written 6.8 years ago by Ryan Dale4.9k
1

The danger is in implying GFF3 can be handled by htseq-count. There is no mention of GFF3 support anywhere for GFF3 in their docs. Also, I can think of a least three instances where htseq-count would have problems:

  1. As mentioned above, any time a parent feature has multiple child features (CDS/UTR, instead of exon) will not work AFAIK, unless htseq-count can use multiple feature types (I think it only handles one feature type at the moment). Any reason to leave out UTR? It's part of the transcript, and this could skew FPKM values (not a big deal if comparing across samples, but a problem if comparing within).
  2. In many cases a feature can have multiple Parent features (an exon present in multiple isoforms for instance), so using a 'Parent' tag as the aggregator for feature types could fail unless htseq is accounting for this somehow.
  3. As you mention, in GTF exons or CDS are tied to both transcripts (via transcript_id) and gene (via gene_id), so calculating values for transcript and gene is easy if parsing only those features. This doesn't always hold true for GFF3, where exons are part of one or more transcript, and each transcript belongs to a gene (multi-layer hierarchy). Getting gene-level counts would involve finding out all exons present for each gene (which may mean getting all this from the transcript level info). Some GFF3 may have custom tags for that, some won't, but it's definitely not consistent.
ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Chris Fields2.1k

Good points. As a side note, the GFF3s I've used use the featuretype "exon" to include CDSs and UTRs, even though CDSs and UTRs are also annotated separately on separate lines. So for those files, using just exon actually does include all transcribed parts of the gene. But as you imply, GFF3 files are far from standardized, leading to the issues you describe. I guess the trick is to get to know your GFF3 file well to make sure problems like this won't happen!

ADD REPLYlink written 6.8 years ago by Ryan Dale4.9k

Hi Chris, I used 'genbank2gff3.pl' script because I didn't find anything better to go from some genbank files to a single GFF3 file. I now HTSeq doesn't like GFF3. I assume I can get rid of the first 6 lines.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by biotech540
0
gravatar for JacobS
6.8 years ago by
JacobS920
Cleveland, Ohio
JacobS920 wrote:

Does HTSeq even use the gene-transcript relation during its operation? Since HTSeq requires us to only specify the feature and name of the gene_id field we want to use to identify the feature, I always assumed that HTSeq strictly reports whatever reads fall within the chosen feature's coordinates according to the scoring schemes: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html. If it does support gene-transcript, it must optionally switch this on if the annotation contains the information, because it isn't necessary for basic operation, like @Chris said.

However, I'm not sure if any of this really matters for your case, since you are working with bacteria, which won't have multiple CDS per gene anyway. Instead of choosing the CDS as your feature of interest, why not choose gene? In every case I saw in your GFF3, the coordinates were the same for the CDS and gene features, and that way your IDs associated with your feature will be something like HPNK_00020 instead of HPNK_00020.p01. If someone thinks this is inaccurate, please correct me, but I can't really think of a reason why you even need the gene-transcript negotiation for a case like this.

ADD COMMENTlink written 6.8 years ago by JacobS920

Thanks jacobS, using 'gene' makes sense.

ADD REPLYlink written 6.8 years ago by biotech540

No problem, hope it helps. And yes, I think you can assume that your output is just fine given the commands and situation you explained.

ADD REPLYlink written 6.8 years ago by JacobS920

Hi jacobS,

Because of choosing CDS, i was loosing expression information about the features not having CDS, like tRNAs, so the way to go, at least for me, is:

python -m HTSeq.scripts.count -m intersection-nonempty -t gene -i locus_tag IonXpressRNA_014_R_2013_03_14_10_59_00_user_PGM-67-BB_pool_Auto_user_PGM-67-BB_pool_68.sam HS327.gff > IonXpressRNA_014.counts

insted of:

python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID IonXpressRNA_014_R_2013_03_14_10_59_00_user_PGM-67-BB_pool_Auto_user_PGM-67-BB_pool_68.sam HS327.gff > IonXpressRNA_014.counts

ADD REPLYlink written 6.7 years ago by biotech540
0
gravatar for biotech
6.8 years ago by
biotech540
United States
biotech540 wrote:

So, can I assume my HTSeq output is fine with this 'custom' GFF3 file?

ADD COMMENTlink written 6.8 years ago by biotech540

For this case (genes == counts), it's likely okay. Any reads spanning genes (like you would find in operons) may be marked as ambiguous based on how you run htseq-count. It's pretty conservative when considering ambiguities.

Just to point out another option: I have heard good things about Rockhopper though I haven't used it myself. It's specifically designed for bacterial RNA-Seq data, and does counts for you as well. Also deals with operon structure, small RNAs, etc. Might be worth a look.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Chris Fields2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1806 users visited in the last hour