Question: Calculate CDS phase in gff3 format
2
17 months ago by
Germany, Mannheim, UMM
marongiu.luigi450 wrote:

Dear all,

I am trying to convert a genbank file into genome feature format version 3. Using the GFF3 online validator the file I have prepared are OK but there are issues with some of the phases of the CDS. I am summarizing the issues in this table:

``````entry   phase   start   end     delta   Delta third
1           0   76014   76437   423         141.00
2           2   76532   76794   262          87.33
3           0   76901   77296   395         131.67
4           0   82169   82217   48           16.00
5           2   82326   83644   1,318       439.33
6           0   83960   85309   1,349       449.67
7           1   86174   88442   2,268       756.00
8           0   88544   89010   466         155.33
9           1   89700   90945   1,245       415.00
10          0   91042   91496   454         151.33

entry   phase   Minus 1 Minus 1 third   Minus 2     Minus 2 third
1       0         422   140.67          421         140.33
2       2         261   87.00           260          86.67
3       0         394   131.33          393         131.00
4       0          47   15.67            46          15.33
5       2        1317   439.00         1316         438.67
6       0        1348   449.33         1347         449.00
8       1        2267   755.67         2266         755.33
9       0         465   155.00          464         154.67
10      1        1244   414.67         1243         414.33
11      0         453   151.00          452         150.67
``````

`entry` is a given CDS feature.

`phase` is the suggested phase from the validator

`start` is the start position of the CDS and `end` its end position

`delta` is the difference `end - start`

`delta third` is `delta/3`

`Minus 1` is `delta -1` and `Minus 1 third` is `(delta -1)/3`

`Minus 2` is `delta -2` and `Minus 2 third` is `(delta -2)/3`

Taking the entry #1, its length delta is divisible by 3 so it makes sense that the validator accepted a phase of 0. Same thing for feat. 4, 6, 8, 10.

The second entry it is not divisible by 3 directly so it is understandable that the validator has flagged it out. Shortening the length of the feature by 1 nucleotide (Minus 1), the feature is now divisible by 3, thus I expected to change the phase to 1, not 2. Same thing for feat. 5.

The third feature is also not divisible by 3, yet the validator did not flag it. It is divisible by 3 after removing two nucleotides, thus I thought the phase should have been 2.

Features 7 and 9 are divisible by 3, yet are flagged with a phase of 1.

It is pretty confusing for me and I haven't found many tutorials online on the subject.

How do I calculate the phase of the CDS based on the start-end positions?

Thanks

modified 17 months ago by Carambakaracho2.0k • written 17 months ago by marongiu.luigi450

I voted the comment to the answer; anyway, I now upvoted and checked the answer itself. Thanks.

Cheers - votes on answers are supposed to indicate the relevance of these answer to future users, it's a bit more prominent and has reason (not only human narcissism) ;-)

3
17 months ago by
Carambakaracho2.0k
Germany/Cologne
Carambakaracho2.0k wrote:

The actual gff3 or original genbank feature might have been useful. Typically this problem occurs only in organisms with spliced proteins, for the second and later exons of a transcript.

From your example, I guess the first three CDS entries correspond to three exons of the same gene (given the 5 kb gap). To complete the last, incomplete codon from the previous exon, you require the next codon to start phase number of nucleotides after the indicated position.

For this to make sense, please note that gff uses 1-based coordinates, so your feature length is `end-start+1`. Hence your first feature is 424 nucleotides long and requires a phasing of 2 for the second feature.

Thanks. The files are a bit big. I am pasting here the most relevant lines. This is the GFF3

``````NC_009333   Genbank CDS 76014   76437   .   +   0   Parent=HHV8GK18_gp55;Name=K8.1;Note=envelope glycoprotein;Dbxref=4961469
*NC_009333  Genbank CDS 76532   76794   .   +   0   Parent=HHV8GK18_gp55;Name=K8.1;Note=envelope glycoprotein;Dbxref=4961469
NC_009333   Genbank CDS 76901   77296   .   -   0   Parent=HHV8GK18_gp56;Name=ORF52;Dbxref=4961496
NC_009333   Genbank CDS 77432   77764   .   -   0   Parent=HHV8GK18_gp57;Name=ORF53;Note=envelope glycoprotein gN;Dbxref=4961468
``````

The second line with the asterisk needs a phase of 2. And it is actually an interrupted CDS. If phase 2 was required for all the spliced CDS, then it would be easy to fix the problem.

This is the related GB:

``````gene            76014..76818
/gene="K8.1"
/locus_tag="HHV8GK18_gp55"
/db_xref="GeneID:4961469"
CDS             join(76014..76437,76532..76794)
/gene="K8.1"
/locus_tag="HHV8GK18_gp55"
/note="envelope glycoprotein; ORF51"
/codon_start=1
/product="K8.1"
/protein_id="YP_001129404.1"
/db_xref="GeneID:4961469"
/translation="MSSTQIRTEIPVALLILCLCLVACHANCPTYRSHLGFWQEGWSG
QVYQDWLGRMNCSYENMTALEAVSLNGTRLAAGSPSSEYPNVSVSVEDTSASGSGEDA
IDESGSGEEERPVTSHVTFMTQSVQATTELTDALISAFSGSYSSGEPSRTTRIRVSPV
AENGRNSGASNRVPFSATTTTTRGRDAHYNAEIRTHLYILWAVGLLLGLVLILYLCVP
RCRRKKPYIV"
``````
1

The phase of a CDS feature depends on the associated upstream CDS feature. If there is the length/3 of the previous CDS feature leaves a remainder of 1, your CDS feature requires a phase of 2 (the last base of the previous and the the first two bases of this form a codon triplett). Standalone features and the first features always have a 0 phase value. Genbank translates hence to

``````NC_009333   Genbank CDS 76014   76437   .   +   0   Parent=HHV8GK18_gp55;Name=K8.1;
NC_009333   Genbank CDS 76532   76794   .   +   2   Parent=HHV8GK18_gp55;Name=K8.1;
``````

However your calculation at the top has to be `end-start+1` and gives hence

``````entry   phase   start   end     delta   Delta third delta%3
1           0   76014   76437   424         141.33    1
2           2   76532   76794   264          87.67     2
``````

It becomes a bit trickier for a (hypothetical) third CDS, you'd have to take into account the phase in order to compute the correct modulo of 3 for this CDS feature.

I added the instruction `(end - start +1) % 3` and a carryover of the last module for the split CDS and looks like the validation is OK. So far so good. Thank you.

I got an exception! The GB file states:

``````CDS             join(119458..120319,120411..120986,121098..121151,
121261..121323,121462..121572,121702..121887,
121986..122144,122290..123056,123263..124735)
``````

I am using the code:

``````n=0
for i in \$CDS_list ; do
start <- \$i[start] ; end <- \$i[end] # here I simplified the code becuase it takes too many lines
span=\$((\$end - \$start +1))
pre_mod=\$((\$span % 3))
mod=\$((\$pre_mod + \$n))
if [[ \$mod == 0 ]] ; then
c8=0
n=0
elif [[ \$mod == 1 ]] ; then
c8=1
n=1
elif [[ \$mod == 2 ]] ; then
c8=2
n=2
else
c8=0
n=0
fi
done
``````

where \$CDS_list is the list of CDS positions `119458..120319 120411..120986` etc, \$end are the positions 120319, 120986 etc, \$start are the positions 119458, 120411 etc. Essentially I am carrying over the module from the previous calculation.

For this CDS, the phase comes out (last column):

``````CDS 119458  120319  .   +   1
CDS 120411  120986  .   +   1 ***
CDS 121098  121151  .   +   1
CDS 121261  121323  .   +   1
CDS 121462  121572  .   +   1
CDS 121702  121887  .   +   1
CDS 121986  122144  .   +   1
CDS 122290  123056  .   +   0
CDS 123263  124735  .   +   0
``````

the GFF validator remarks:

`````` CDS feature on line 110 in file has the wrong phase 1 (should be 0)
``````

Line 110 is highlighted.

Since this `phase` concept is a bit hard for me, would you have suggestions on its calculation? What did I got wrong?

Thank you.