Question: Duplicated CDS start/stop entries in GFF-file
gravatar for manaswwm
10 months ago by
manaswwm130 wrote:


I am trying to extract information from a gff-file using the gffutils package ( in python. After creating and loading the "local database" from the gff-file, I want to extract CDS entries for every gene entry (start and stop positions of the CDS), however, I noticed that some genes have multiple entries for CDS, wherein two or more CDS entries either have the same start and/or stop position.

Example : A gene (X) has two CDS entries (CDS_1, CDS_2) and the start and stop positions of these two CDSs are - CDS_1 - start : 75221, stop : 76890 (transcript id - ABC1.1) ; CDS_2 - start : 75221, stop : 76908 (transcript id - ABC1.2)

Now, CDS_2 has 18bp more "coverage" than CDS_1, so my logic asks me to delete CDS_1 and only consider CDS_2 for mRNA. Am I correct in assuming so? Also, how do you delete such potentially obsolete/duplicate(?) entries? I tried reading through the documentation of this package, but I could not find any solution.

Thanks in advance!

gffutils python gff-file • 520 views
ADD COMMENTlink modified 9 months ago • written 10 months ago by manaswwm130
gravatar for Juke34
10 months ago by
Juke344.7k wrote:

It is not duplicate entries they are just isoforms...
You can use from AGAT if you wish to filter the annotation and keep only le longest isoform per locus.
Then you can extract the start codons: --gff file.gff -f file.fasta --eo --up -3
and the stop codons: --gff file.gff -f file.fasta --eo --do "-3"

ADD COMMENTlink modified 10 months ago • written 10 months ago by Juke344.7k

@Juke-34 many thanks!

I guess this should help. Just out of curiosity - I see this package is for Perl? Do you have any suggestions for python? I could always just trim the file of the short isoforms (thanks for correcting me) and load it in python (I suppose?) and then perform the next tasks, but would be super if I could do this in python too!

ADD REPLYlink written 10 months ago by manaswwm130

it is available as conda package so you don't care it is in perl ;)

ADD REPLYlink written 10 months ago by Juke344.7k
gravatar for manaswwm
9 months ago by
manaswwm130 wrote:

@Juke-34 - Many thanks for your suggestion. However, I figured a way out to overcome this problem using the following link (, scroll to "Longest protein for this gene") and some addition of code. In my opinion, following should do the trick -

transcript = {}
for item in db.children(gene, featuretype="mRNA"):
    if db[gene].strand == item.strand:
        transcripts[]= sum(len(sub_item) for sub_item in db.children(item, featuretype="CDS"))

Above, I access all possible mRNA transcripts for a given gene, and I store the "sum" (read concatenate) of all CDS for every transcript in transcript dictionary

trans = (sorted(transcripts.items(), key=lambda x: x[1], reverse=True)[0])[0]
trans_length = (sorted(transcripts.items(), key=lambda x: x[1], reverse=True)[0])[1]

I sort the items in the transcript dictionary in an descending order of the length of the transcript, and access the 0th entry (aka the 1st entry, which will contain the longest transcript). Here, trans contains the id of the longest transcript (which I use later for accessing all CDS in that transcript) and trans_length contains the length of this transcript.

for item in db.children(trans_element, featuretype="CDS"):

Finally, you can access the CDS (or any other feature of interest, like exons, UTRs etc) like above.

@genomax - thanks for suggesting to attach code, apologies for not doing so in the first place

ADD COMMENTlink modified 9 months ago • written 9 months ago by manaswwm130

Unless you post a complete solution of how you did this this answer should move to a comment. some addition of code is not enough for a future visitor to have a working solution.

ADD REPLYlink written 9 months ago by genomax90k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1008 users visited in the last hour