Question: Gtf Format End Coord Included Or Excluded
gravatar for vodka
7.5 years ago by
vodka80 wrote:

Hello, I've been using htseq-count (and the related python library to develop my own tools) to obtain read counts over known genes many times but only yesterday I decided (due to a modification in the default value in the library about the openness or closedness of the Ensembl gtf coords) to perform an exaustive check...I have to say that the results puzzled me. Following the suggestion given in the manual ( I checked the divisibility by 3 of the end-begin+1 intervals reported for CDF and stop_codon feature but the results were not homogeneus as I expected:

$ wget
$ zcat Homo_sapiens.GRCh37.70.gtf.gz | awk '$3 == "CDS" {P=($5-$4+1)/3; print  $1,$3,P==int(P)}' | grep -c "0$"
$ zcat Homo_sapiens.GRCh37.70.gtf.gz | awk '$3 == "CDS" {P=($5-$4+1)/3; print  $1,$3,P==int(P)}' | grep -c "1$"

I checked by hand a few examples and there are protein coding genes in both categories. For stop-codons the results are a little more in agreement with the docs: 796 zeroes and 82902 ones. What is really buggering me is that I am not sure what CDS and stop_codon features are...reading the docs in a hurry some times ago I did not realize that the whole modulus 3 question is strange as referred to genomic coordinates. Reading here: it seems that CDS refers to exons and thus I don't expect any divisibility by 3 a priori, while for stop_codon I do (or not, if they can be spliced as reported in the last link).

My final question is: which is the coordinates standard (1- or 0- based, closed or open intervals) for gtf found on Ensembl and on the illumina project gtf linked in the tophat page? I do realize that a single base should not be that much relevant but if there isn't any agreement on this basic things I suspect that bigger problems will arise.

gtf annotation coordinates • 2.0k views
ADD COMMENTlink modified 7.5 years ago by Devon Ryan96k • written 7.5 years ago by vodka80
gravatar for Devon Ryan
7.5 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

You have a few questions, so I'll try to get to all of them in turn:

1) The heterogeneity you observed is due to splicing, which can happen in the middle of (not just between) codons. That is also the case for stop codons, though it seems to occur less frequently. That sort of thing is normal. In genomic, as opposed to transcript, coordinates, there's no guarantee that length%3==0.

2) CDS features just denote coding sequences (these may be entire exons or only a portion of them), i.e., the part that codes for the protein sequence. stop_codon denotes the TGA/TAG/TAA sequence that denotes the end-point of translation. Anything following that would be 3' UTR (likewise, anything in an exon before the start_codon feature is 5" UTR).

3) GTF files are 1-based with closed intervals. That is to say that if column 4 is A and column 5 is B, then the interval is [A,B].

ADD COMMENTlink written 7.5 years ago by Devon Ryan96k

Thank you very much! Everything is clear now, I think that htseq-counts docs should be changed because right now they are misleading as long as they declare that the "type" of a gtf could be derived analyzing %3 on CDS.

ADD REPLYlink written 7.5 years ago by vodka80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2043 users visited in the last hour