Duplicated CDS start/stop entries in GFF-file
2
0
Entering edit mode
2.0 years ago
manaswwm ▴ 170

Hello,

I am trying to extract information from a gff-file using the gffutils package (https://daler.github.io/gffutils/) in python. After creating and loading the "local database" from the gff-file, I want to extract CDS entries for every gene entry (start and stop positions of the CDS), however, I noticed that some genes have multiple entries for CDS, wherein two or more CDS entries either have the same start and/or stop position.

Example : A gene (X) has two CDS entries (CDS_1, CDS_2) and the start and stop positions of these two CDSs are - CDS_1 - start : 75221, stop : 76890 (transcript id - ABC1.1) ; CDS_2 - start : 75221, stop : 76908 (transcript id - ABC1.2)

Now, CDS_2 has 18bp more "coverage" than CDS_1, so my logic asks me to delete CDS_1 and only consider CDS_2 for mRNA. Am I correct in assuming so? Also, how do you delete such potentially obsolete/duplicate(?) entries? I tried reading through the documentation of this package, but I could not find any solution.

python gffutils gff-file • 1.2k views
1
Entering edit mode
2.0 years ago
Juke34 ★ 6.4k

It is not duplicate entries they are just isoforms...
You can use agat_sp_keep_longest_isoform.pl from AGAT if you wish to filter the annotation and keep only le longest isoform per locus.
Then you can extract the start codons:
agat_sp_extract_sequences.pl --gff file.gff -f file.fasta --eo --up -3
and the stop codons:
agat_sp_extract_sequences.pl --gff file.gff -f file.fasta --eo --do "-3"

0
Entering edit mode

@Juke-34 many thanks!

I guess this should help. Just out of curiosity - I see this package is for Perl? Do you have any suggestions for python? I could always just trim the file of the short isoforms (thanks for correcting me) and load it in python (I suppose?) and then perform the next tasks, but would be super if I could do this in python too!

1
Entering edit mode

it is available as conda package so you don't care it is in perl ;)

0
Entering edit mode
2.0 years ago
manaswwm ▴ 170

@Juke-34 - Many thanks for your suggestion. However, I figured a way out to overcome this problem using the following link (https://libraries.io/github/YeoLab/gffutils, scroll to "Longest protein for this gene") and some addition of code. In my opinion, following should do the trick -

transcript = {}
for item in db.children(gene, featuretype="mRNA"):
if db[gene].strand == item.strand:
transcripts[item.id]= sum(len(sub_item) for sub_item in db.children(item, featuretype="CDS"))


Above, I access all possible mRNA transcripts for a given gene, and I store the "sum" (read concatenate) of all CDS for every transcript in transcript dictionary

trans = (sorted(transcripts.items(), key=lambda x: x[1], reverse=True)[0])[0]
trans_length = (sorted(transcripts.items(), key=lambda x: x[1], reverse=True)[0])[1]


I sort the items in the transcript dictionary in an descending order of the length of the transcript, and access the 0th entry (aka the 1st entry, which will contain the longest transcript). Here, trans contains the id of the longest transcript (which I use later for accessing all CDS in that transcript) and trans_length contains the length of this transcript.

for item in db.children(trans_element, featuretype="CDS"):
print(item.start)


Finally, you can access the CDS (or any other feature of interest, like exons, UTRs etc) like above.

@genomax - thanks for suggesting to attach code, apologies for not doing so in the first place

0
Entering edit mode

Unless you post a complete solution of how you did this this answer should move to a comment. some addition of code is not enough for a future visitor to have a working solution.