I am trying to extract information from a gff-file using the gffutils package (https://daler.github.io/gffutils/) in python. After creating and loading the "local database" from the gff-file, I want to extract CDS entries for every gene entry (start and stop positions of the CDS), however, I noticed that some genes have multiple entries for CDS, wherein two or more CDS entries either have the same start and/or stop position.
Example : A gene (X) has two CDS entries (CDS_1, CDS_2) and the start and stop positions of these two CDSs are - CDS_1 - start : 75221, stop : 76890 (transcript id - ABC1.1) ; CDS_2 - start : 75221, stop : 76908 (transcript id - ABC1.2)
Now, CDS_2 has 18bp more "coverage" than CDS_1, so my logic asks me to delete CDS_1 and only consider CDS_2 for mRNA. Am I correct in assuming so? Also, how do you delete such potentially obsolete/duplicate(?) entries? I tried reading through the documentation of this package, but I could not find any solution.
Thanks in advance!