start - start position of the feature in positive 1-based integer
coordinates
always less than or equal to end end - end position of the feature in positive 1-based integer coordinates
But this kind of weird features will not be of much use when counting...
Although it is interesting to find out how this annotation was produced resulting in this, they are quite useless for counting reads since it's an empty interval and as such no read can map 'in' that feature. So for this yes I would suggest to a) check how many cases like this exist and if not too many b) remove those
When did you download the data? Ensembl Genomes (and its Plants division) had a new release yesterday (release 32). The release 32 gff3 data seems to be different from release 31 gff3 from the Ensembl Genomes FTP site. If the issue persists, please send an email to the Ensembl Genomes helpdesk. What is the gene you looked at as an example? PGSC?
So, your gene is minus strand, you can have a look at it here
The CDS is distributed over 7 exons see details here
The end of the 6th exon is ...CTT TA that is part of the CDS (in blue ).
It misses one nucleotide to finish the last codon. This last nucleotide is on the 7th exon and is G.....
TAG = stop codon = end of your CDS.
That is completely in line with what says the gff !
You can easily check what is at the position 33567472-33567472 of the chromosome 11
It doesn't correspond to a position between 2 nucleotides but correspond well to one nucleotide !
It looks correct for me.
This piece of the CDS is size 1 ! (Length = end-start+1)
It's just a case where your CDS start on the last nucleotide of one exon, and the next of the CDS is on the next exon. So, if you remove this feature, you will break the ORF of your CDS.
That is what I am afraid off - that I am going to remove some informations that are there for some reasons. So, to parse this gff file, would you suggest to add 1 to start position? or at end position? Or maybe according to plus or minus strand? The HTseq perform simple subtraction, and since those values are the same it ends up with 0 (error "Cannot subset to zero-lenght interval" so I need to modify this ...
For zero-length features, such as insertion sites, start equals end
and the implied site is to the right of the indicated base in the
direction of the landmark.
But I can't find another reference or something about how it's exactly defined.
Ok. So here is what I think should be done in this situation: merging cds to trancript - that will remove all 0 values. Somebody disagree?
I think gffread program will do that ...I hope so ...
I would like to add all cds and present it as start:end of transcript. So I will have transcript 1 of gene 1 with start: end, transcript2 of gene 1 with start: end... Without dividing the data into cds. In that case I would still have the information about last nucleotide, and no 0 values. I don't see any other solution.
But doing that you modify the information ... not the whole transcript is coding ! I don't know what you want to compute exactly but you will consider untranslated region as translated.
The problem you encountered is a problem only when the start and end are the same (your case) or for any location ? I guess HTSeq don't handle the location correctly ... they omit one nucleotide by location defined. So, is it an error in their implementation or they are not handle base-1 location ?
If the problem affect all location you just have to add +1 to each END position in positive strand, and -1 to each START position in negative strand feature.
According to ftp://ftp.ensemblgenomes.org/pub/plants/release-32/gff3/solanum_tuberosum/README
But this kind of weird features will not be of much use when counting...
Thanks, so you are suggesting to remove those annotations?
Although it is interesting to find out how this annotation was produced resulting in this, they are quite useless for counting reads since it's an empty interval and as such no read can map 'in' that feature. So for this yes I would suggest to a) check how many cases like this exist and if not too many b) remove those
Thanks, will do as you say :)
When did you download the data? Ensembl Genomes (and its Plants division) had a new release yesterday (release 32). The release 32 gff3 data seems to be different from release 31 gff3 from the Ensembl Genomes FTP site. If the issue persists, please send an email to the Ensembl Genomes helpdesk. What is the gene you looked at as an example? PGSC?
It is release 32. The name of file is Solanum_tuberosum.SolTub_3.0.32.gff3. ID=gene=PGSC0003DMG400022057