There is indeed not alway a single way of encoding the same structure into a GFF, but I have worked with GFFs, Tripal and GBrowse (and BioPerl) imports a lot. Sometimes, one also might need slightly different structure for different applications, and it depends more on adapting the gff such that it works best with the desired application.
Whenever a feature already has an ID it should not be changed, but the name attribute can be used to add a human readable name to it. An ID field is required by some or most application to allow the mapping of features to their parent feature, and thereby allowing to rebuild gene models (e.g.: CDS -> mRNA -> gene).
- exon: If a exon is present in more than one isoform, in my gff3 I have two lines of the same exon (one for transcript1 and the other for
transcript2). The ID for these two lines should be the same?
In general, different entities should also have different IDs (exception below), but see: Using Multiple Parent Values In Gff3 Format? If the exons are fully identical they could be encoded in a single line with multiple transcripts as parents.
- CDS: Please correct me if I'm wrong. All CDS belonging to a given protein must share the ID, isn't it?
Yes, they CDS lines can be seen as an aggregate of the pieces that form one coding sequence per transcript, so they are a single entity.
GBrowse f.e. will be able to reconstruct the correct coding sequence from segments of coding sequences, only if they have the same ID.
You can have IDs or not, it is important to annotate the correct coding sequence. In GBrowse, f.e. if UTRs and exons are annotated, the CDS can be inferred are not needed in the import.
Finally, in the ID can I put a "random" number, like exon1, exon2,
Yes that is possibly how most files would encode it. Ensembl ids would be an example they just assign a running number with a prefix ("EMLSAE000000000001") to make stable IDs.
or it is better to mention the parent feature, like,
transcript1.exon1, transcript1.exon2 .
That is more readable, but not required, some cDNA aligners (like GMAP) use a similar format. The downside is that these would be less suited as stable IDs. Imagine the assignment of the parent is changed, you will have either an ID implying a relation that is no longer true and live with the misleading ids, or you have to change the ID and all reference, which again makes these "un-stable" IDs. That is possibly why the random numbering is used by Ensembl.