Question: Duplicated CDS start/stop entries in GFF-file
0
gravatar for manaswwm
8 days ago by
manaswwm30
manaswwm30 wrote:

Hello,

I am trying to extract information from a gff-file using the gffutils package (https://daler.github.io/gffutils/) in python. After creating and loading the "local database" from the gff-file, I want to extract CDS entries for every gene entry (start and stop positions of the CDS), however, I noticed that some genes have multiple entries for CDS, wherein two or more CDS entries either have the same start and/or stop position.

Example : A gene (X) has two CDS entries (CDS_1, CDS_2) and the start and stop positions of these two CDSs are - CDS_1 - start : 75221, stop : 76890 (transcript id - ABC1.1) ; CDS_2 - start : 75221, stop : 76908 (transcript id - ABC1.2)

Now, CDS_2 has 18bp more "coverage" than CDS_1, so my logic asks me to delete CDS_1 and only consider CDS_2 for mRNA. Am I correct in assuming so? Also, how do you delete such potentially obsolete/duplicate(?) entries? I tried reading through the documentation of this package, but I could not find any solution.

Thanks in advance!

gffutils python gff-file • 83 views
ADD COMMENTlink modified 5 days ago • written 8 days ago by manaswwm30
1
gravatar for Juke-34
8 days ago by
Juke-343.1k
Sweden
Juke-343.1k wrote:

It is not duplicate entries they are just isoforms...
You can use agat_sp_keep_longest_isoform.pl from AGAT if you wish to filter the annotation and keep only le longest isoform per locus.
Then you can extract the start codons:
agat_sp_extract_sequences.pl --gff file.gff -f file.fasta --eo --up -3
and the stop codons:
agat_sp_extract_sequences.pl --gff file.gff -f file.fasta --eo --do "-3"

ADD COMMENTlink modified 8 days ago • written 8 days ago by Juke-343.1k

@Juke-34 many thanks!

I guess this should help. Just out of curiosity - I see this package is for Perl? Do you have any suggestions for python? I could always just trim the file of the short isoforms (thanks for correcting me) and load it in python (I suppose?) and then perform the next tasks, but would be super if I could do this in python too!

ADD REPLYlink written 8 days ago by manaswwm30
1

it is available as conda package so you don't care it is in perl ;)

ADD REPLYlink written 8 days ago by Juke-343.1k
0
gravatar for manaswwm
5 days ago by
manaswwm30
manaswwm30 wrote:

@Juke-34 - Many thanks for your suggestion. However, I figured a way out to overcome this problem using the following link (https://libraries.io/github/YeoLab/gffutils, scroll to "Longest protein for this gene") and some addition of code. In my opinion, following should do the trick -

transcript = {}
for item in db.children(gene, featuretype="mRNA"):
    if db[gene].strand == item.strand:
        transcripts[item.id]= sum(len(sub_item) for sub_item in db.children(item, featuretype="CDS"))

Above, I access all possible mRNA transcripts for a given gene, and I store the "sum" (read concatenate) of all CDS for every transcript in transcript dictionary

trans = (sorted(transcripts.items(), key=lambda x: x[1], reverse=True)[0])[0]
trans_length = (sorted(transcripts.items(), key=lambda x: x[1], reverse=True)[0])[1]

I sort the items in the transcript dictionary in an descending order of the length of the transcript, and access the 0th entry (aka the 1st entry, which will contain the longest transcript). Here, trans contains the id of the longest transcript (which I use later for accessing all CDS in that transcript) and trans_length contains the length of this transcript.

for item in db.children(trans_element, featuretype="CDS"):
    print(item.start)

Finally, you can access the CDS (or any other feature of interest, like exons, UTRs etc) like above.

@genomax - thanks for suggesting to attach code, apologies for not doing so in the first place

ADD COMMENTlink modified 5 days ago • written 5 days ago by manaswwm30

Unless you post a complete solution of how you did this this answer should move to a comment. some addition of code is not enough for a future visitor to have a working solution.

ADD REPLYlink written 5 days ago by genomax75k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 856 users visited in the last hour