Hi everyone,
I am currently working on a gff file from PlasmoDB (plasmodium reference database), and i'm using a personnal Perl script to extract informations. Once my script over, i observed that almost every gene on the reverse strand does not match with its amino-acid sequence. First i thought about a coding mistake, so i've re-typed all my script with someone else to validate each step, but we get the same result. It could also be a misunderstanding of the frame definition on reverse strand, so i checked it and it was what i've understood first.
Finally i go through the gff file and saw that, for must of the CDS i watched, the frame doesn't correspond with the reality. For example:
02 EuPathDB CDS 389427 389618 . - 2 gene_id "PF3D7_0209400.1-p1"
02 EuPathDB CDS 389742 390404 . - 1 gene_id "PF3D7_0209400.1-p1"
02 EuPathDB CDS 390504 390611 . - 0 gene_id "PF3D7_0209400.1-p1"
We can saw that the size of the first CDS (so the last one on the file) is 108, so 36 codons and no nucleotide left. This should lead to a frame as 0 for the following CDS, and that's not the case.
Does anyone have an idea? I guess i miss something, but can't figure what it is...
Thanks
It is hard to tell without seing your script. If you want any advice on your script you should rather post it, and yes it is likely that you made a mistake (most common mistake is to forget either of reverse/complement before translation negative strand genes). Therefore you should use existing code as much as possible.
No, check your coordinates again (there is 100bp between the first two CDS)!
The problem isn't the script but the understanding of the frame. The size between the two CDS doesn't matter because it will be splice.
What i first understood: The frame should be either the reading phase or the position of the nucleotide corresponding to the first base of the first codon of the CDS. This means that a frame equal to '1' corresponds to a codon shared between the previous CDS giving 2 nucleotides and this CDS giving the last one.
In the example, the size of the first CDS equals 36 codons, so how could the frame of the following one be '1' ?
And positions are inclusive so it 99 bases between the first and second CDS.
Maybe you are referring to the phase? More information here: http://gmod.org/wiki/GFF This can get quite complicated to sort out but if your script gets the AA sequence wrong I would assume your program is at fault. I would resort to well tested libraries like BioPerl/BioPython anyway.