Frame of CDS on reverse strand: I don't get it or is there a problem in file?
1
0
Entering edit mode
6.4 years ago

Hi everyone,

I am currently working on a gff file from PlasmoDB (plasmodium reference database), and i'm using a personnal Perl script to extract informations. Once my script over, i observed that almost every gene on the reverse strand does not match with its amino-acid sequence. First i thought about a coding mistake, so i've re-typed all my script with someone else to validate each step, but we get the same result. It could also be a misunderstanding of the frame definition on reverse strand, so i checked it and it was what i've understood first.

Finally i go through the gff file and saw that, for must of the CDS i watched, the frame doesn't correspond with the reality. For example:

02 EuPathDB CDS 389427 389618 . - 2 gene_id "PF3D7_0209400.1-p1"

02 EuPathDB CDS 389742 390404 . - 1 gene_id "PF3D7_0209400.1-p1"

02 EuPathDB CDS 390504 390611 . - 0 gene_id "PF3D7_0209400.1-p1"

We can saw that the size of the first CDS (so the last one on the file) is 108, so 36 codons and no nucleotide left. This should lead to a frame as 0 for the following CDS, and that's not the case.

Does anyone have an idea? I guess i miss something, but can't figure what it is...

Thanks

gff frame reverse strand • 3.0k views
ADD COMMENT
0
Entering edit mode

It is hard to tell without seing your script. If you want any advice on your script you should rather post it, and yes it is likely that you made a mistake (most common mistake is to forget either of reverse/complement before translation negative strand genes). Therefore you should use existing code as much as possible.

We can saw that the size of the first CDS (so the last one on the file) is 108, so 36 codons and no nucleotide left. This should lead to a frame as 0 for the following CDS, and that's not the case.

No, check your coordinates again (there is 100bp between the first two CDS)!

ADD REPLY
0
Entering edit mode

The problem isn't the script but the understanding of the frame. The size between the two CDS doesn't matter because it will be splice.

What i first understood: The frame should be either the reading phase or the position of the nucleotide corresponding to the first base of the first codon of the CDS. This means that a frame equal to '1' corresponds to a codon shared between the previous CDS giving 2 nucleotides and this CDS giving the last one.

In the example, the size of the first CDS equals 36 codons, so how could the frame of the following one be '1' ?

And positions are inclusive so it 99 bases between the first and second CDS.

ADD REPLY
0
Entering edit mode

Maybe you are referring to the phase? More information here: http://gmod.org/wiki/GFF This can get quite complicated to sort out but if your script gets the AA sequence wrong I would assume your program is at fault. I would resort to well tested libraries like BioPerl/BioPython anyway.

ADD REPLY
1
Entering edit mode
6.4 years ago

I made a mistake, i talk about the phase, and not the frame (in the gtf documentation, i found most of the time Frame with the same definition as Phase in gff files, my bad).

"the phase indicates where the feature begins with reference to the reading frame"

So since the beginning it appears i must be talking about the phase. But this don't change my problem.

"a phase of "1" indicates that the next codon begins at the second base of this region[...]For reverse strand features, phase is counted from the end field."

This is what i understood in the first place. The problem isn't to get the peptide sequence, i get it right from the fasta file and the coordinate given in the gff, using BioPerl or any script.

The problem is to understand this gff. As an example, if you take the very same CDS from the NCBI gff, and compare it with the PlasmoDB gff, you can observe that the phase isn't the same, and i can't explain or understand variation.

NCBI:

CDS 390504  390611  .   -   0   ID=cds228
CDS 389742  390404  .   -   0   ID=cds228

PlasmoDB:

CDS 389742  390404  .   -   1   gene_id "PF3D7_0209400.1-p1";
CDS 390504  390611  .   -   0   gene_id "PF3D7_0209400.1-p1";

Ok the way to write isn't the same, first CDS in first position (NCBI) or forward strand order (PlasmoDB), but this shouldn't affect in any way the phase, only calculated with the size and phase of the previous CDS.

How can this difference be explain? Is there multiple way to calculate the phase?

And even if one gff is indicating the phase and the other one is indicating the frame, " the frame [...] is simply start modulo 3", so in every case we should read a 0 for the second CDS (389742 modulo 3 is 0).

(when i manually calculate the phase, i get the same results as the NCBI file)

ADD COMMENT
0
Entering edit mode

It looks to be an error in the PlasmoDB gff file. It occurs more often than people think... Even in serious DBs.

ADD REPLY
0
Entering edit mode

This is what i think, but first i must be sure that i don't miss anything before reporting any error. Anyway, i took older release of the same gff and there the phase was correct, son it appears that something might have change in the annotation pipeline of EuPathDB, leading to these differences.

I send a report message, and i will conclude this post once i get their answer.

Thanks all for your messages

ADD REPLY

Login before adding your answer.

Traffic: 2050 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6