The question was asked before in Calculate CDS phase in gff3 format ; Negative value in "phase" line of a gff3 file.What does it mean? ; etc... but I still don't get it.
So let's use an existing GFF3 file: https://github.com/samtools/bcftools/blob/develop/test/csq/ENST00000580206/short.gff
The GFF3 is valid in 'bcftools csq'
This is a positive strand gene with only one transcript:
18 ensembl mRNA 20068 124566 . + . ID=transcript:ENST00000358984;
Here are the CDS ordered from 5' to 3'
18 ensembl CDS 20248 20468 . + 0 ID=CDS:ENSP00000351875;Parent=transcript:ENST00000358984;protein_id=ENSP00000351875
18 ensembl CDS 24394 24508 . + 1 ID=CDS:ENSP00000351875;Parent=transcript:ENST00000358984;protein_id=ENSP00000351875
18 ensembl CDS 24667 24840 . + 0 ID=CDS:ENSP00000351875;Parent=transcript:ENST00000358984;protein_id=ENSP00000351875
(...)
so the first base of the first CDS is 18:20248
, it's the first base of the cDNA. The 0-based position of the cDNA is 0 and the phase is 0%3==0. OK.
The length of the first CDS is 20468-20248+1=221
The 0-based index of the cDNA for the second CDS(chr18:24394-24508)
will be 221. The phase of the cDNA at 221 is 221%3=2
but it's 1 in the GFF.
Where am i wrong ? a +/-1 shift ?
I wrote a awk script for the whole GTF but it is obviously wrong:
grep CDSshort.gff | cut -f1-8 | awk 'BEGIN{P=0;} {L=int($5)-int($4)+1;printf("%s p=%d phase=%d L=%d\n",$0,P,P%3,L);P+=L;}'
18 ensembl CDS 20248 20468 . + 0 p=0 phase=0 L=221
18 ensembl CDS 24394 24508 . + 1 p=221 phase=2 L=115
18 ensembl CDS 24667 24840 . + 0 p=336 phase=0 L=174
18 ensembl CDS 26727 26833 . + 0 p=510 phase=0 L=107
18 ensembl CDS 29643 29780 . + 1 p=617 phase=2 L=138
(...)
where am I wrong ?
( and future question about the negative strand CDS ? is the phase defined for CDS.start or CDS.end ?)
Ah https://genomic.social/@scottcain/110923216702965764
This is different from https://www.ensembl.org/info/website/upload/gff.html