Question

CDS phase 0,1,2 in GFF format

0

Entering edit mode

8 months ago

Pierre Lindenbaum 161k

The question was asked before in Calculate CDS phase in gff3 format ; Negative value in "phase" line of a gff3 file.What does it mean? ; etc... but I still don't get it.

So let's use an existing GFF3 file: https://github.com/samtools/bcftools/blob/develop/test/csq/ENST00000580206/short.gff

The GFF3 is valid in 'bcftools csq'

This is a positive strand gene with only one transcript:

18  ensembl mRNA    20068   124566  .   +   .   ID=transcript:ENST00000358984;

Here are the CDS ordered from 5' to 3'

18  ensembl CDS 20248   20468   .   +   0   ID=CDS:ENSP00000351875;Parent=transcript:ENST00000358984;protein_id=ENSP00000351875
18  ensembl CDS 24394   24508   .   +   1   ID=CDS:ENSP00000351875;Parent=transcript:ENST00000358984;protein_id=ENSP00000351875
18  ensembl CDS 24667   24840   .   +   0   ID=CDS:ENSP00000351875;Parent=transcript:ENST00000358984;protein_id=ENSP00000351875
(...)

so the first base of the first CDS is 18:20248 , it's the first base of the cDNA. The 0-based position of the cDNA is 0 and the phase is 0%3==0. OK.

The length of the first CDS is 20468-20248+1=221

The 0-based index of the cDNA for the second CDS(chr18:24394-24508) will be 221. The phase of the cDNA at 221 is 221%3=2 but it's 1 in the GFF.

Where am i wrong ? a +/-1 shift ?

I wrote a awk script for the whole GTF but it is obviously wrong:

grep CDSshort.gff | cut -f1-8 | awk 'BEGIN{P=0;} {L=int($5)-int($4)+1;printf("%s p=%d phase=%d L=%d\n",$0,P,P%3,L);P+=L;}'
18  ensembl CDS 20248   20468   .   +   0 p=0 phase=0 L=221
18  ensembl CDS 24394   24508   .   +   1 p=221 phase=2 L=115
18  ensembl CDS 24667   24840   .   +   0 p=336 phase=0 L=174
18  ensembl CDS 26727   26833   .   +   0 p=510 phase=0 L=107
18  ensembl CDS 29643   29780   .   +   1 p=617 phase=2 L=138
(...)

where am I wrong ?

( and future question about the negative strand CDS ? is the phase defined for CDS.start or CDS.end ?)

gff3 gff cds phasing phase • 726 views

ADD COMMENT • link 8 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Ah https://genomic.social/@scottcain/110923216702965764

Basically, phase represents how many bases need to be skipped for a given CDS region so that translation will be in frame. Looked at another way, it’s how many bases at the beginning of the CDS region belong to the last codon of the previous CDS region.

This is different from https://www.ensembl.org/info/website/upload/gff.html

One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '

ADD REPLY • link 8 months ago by Pierre Lindenbaum 161k

score 1 · Accepted Answer · 2023-08-20

OK got it, it's the number of bases to be skipped.

    $ grep CDS short.gff | cut -f1-8 | awk 'BEGIN{P=0;} {L=int($5)-int($4)+1;printf("%s p=%d phase=%d L=%d\n",$0,P,(P%3==0?0:3-P%3),L);P+=L;}'
    18  ensembl CDS 20248   20468   .   +   0 p=0 phase=0 L=221
    18  ensembl CDS 24394   24508   .   +   1 p=221 phase=1 L=115
    18  ensembl CDS 24667   24840   .   +   0 p=336 phase=0 L=174
    18  ensembl CDS 26727   26833   .   +   0 p=510 phase=0 L=107
    18  ensembl CDS 29643   29780   .   +   1 p=617 phase=1 L=138
'...)

biostars wants stuff to validate my answer... adz ad a da zd ad