Question

How to add the start and stop position information to a gff file?

0

Entering edit mode

7.2 years ago

I0110 ▴ 140

Hi,

I am using the new tomato ITAG3.0 annotation, but the gff file does not contain the rows for the start and stop codon positions. Is there a way to fix that with R or Python? In other words, is there a way to use the existing range information, including "gene","mRNA","CDS", "exon" to generate a gff file with the start and stop positions?

Thanks! Larry

genome R Assembly Python • 5.0k views

ADD COMMENT • link updated 4.6 years ago by Juke34 8.5k • written 7.2 years ago by I0110 ▴ 140

1

Entering edit mode

Hi, did you solve the problem? would you mind to share the solution, pls? Thanks..

ADD REPLY • link 4.8 years ago by pigeon0411 ▴ 10

score 3 · Answer 1 · 2019-09-09

I have a perl script for that purpose called

~~gff3_sp_add_start_and_stop.pl available at the GAAS repository.~~

~~gff3_sp_add_start_and_stop.pl --gff infile.gff --fasta infile.fasta -o output.gff~~

agat_sp_add_start_and_stop.pl within the gff toolkit AGAT:

agat_sp_add_start_and_stop.pl --gff infile.gff --fasta infile.fasta -o  output.gff

You can specify the codon table to use (1 by default). It deals with start or stop codon that would be split over several exons.

score 1 · Answer 2 · 2018-04-25

1

Entering edit mode

6.0 years ago

brendanmwee ▴ 10

14 months later... I am currently dealing with this same issue. The approaches I have come up with are just imputing the beginning and end codons of the exon. This doesn't work very well, but it allows my pipeline to progress.

The other ideas I have are using the exon interval to find the nearest AUG to the exon and write in a Start and stop codon entry to the gtf. or use the entries in my current GTF that match CCDS entries and find the start and stop codons in the reference GTF.

I will post whatever we come up with in the end

ADD COMMENT • link 6.0 years ago by brendanmwee ▴ 10

0

Entering edit mode

what about this: ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_gene_models.gff

It has gene, mRNA, exon, CDS tags.

For a old one: ftp://ftp.solgenomics.net/tomato_genome/annotation/beta_release/ITAG2.90_updated-ITAG2.40/ITAG2.90_gene_models.gff

ADD REPLY • link 6.0 years ago by cpad0112 21k

score 1 · Answer 3 · 2019-08-29

It's a bit old but I use aegean/CanonGFF and genometools to add the features that are typically meant to be inferred. E.G. UTRs, introns, start, and stop

It does rely on the GFF adhering the standards/conventions though, so you'll need gene>mRNA>CDS/exon gene structures, and the CDS should contain the stop codon.

Typically I would do something like this.

gt gff3 -tidy -sort -retainids my.gff3 | canon-gff3 -i - > my_with_stops.gff3

score 0 · Answer 4 · 2017-02-13

0

Entering edit mode

7.2 years ago

Michael 54k

That is implicit, the start codon should be at the first 5' CDS position, the stop codon at the 3' postion of the last CDS. Mind strand and eventual phase. Genome browsers normally don't require the start and stop codon information. Also, there is possibly a reason to not annotate the start and stop codons. Current gene models are notoriously error prone and often based on automatic prediction only, stating an exact start codon f.e. implies a possibly undue confidence. Experimental techniques such as ribosome profiling have often delivered surprising result with respect to unexpected translation initiation sites (I'll find you a citation for that...).

(If CDS are not annotated, you have to subtract the 5'/3'Utr's from the terminal exons.)

ADD COMMENT • link 7.2 years ago by Michael 54k

0

Entering edit mode

I know these facts, but is there a way to automatically do that with R or python? Thanks a lot!

ADD REPLY • link 7.2 years ago by I0110 ▴ 140

0

Entering edit mode

Sure there is, you just need to write a little script e.g. based on GRanges in R ;)

ADD REPLY • link 7.2 years ago by Michael 54k

0

Entering edit mode

A little hint will be much appreciated. Thanks!

ADD REPLY • link 7.2 years ago by I0110 ▴ 140

0

Entering edit mode

It's a few lines in perl but I don't have time now. If you don't have a solution by tomorrow, send me a gentle reminder.

ADD REPLY • link 7.2 years ago by Michael 54k

0

Entering edit mode

Thanks! I tried to write it in R, but it is more difficult than I thought. The problem I have is the start and stop codons on the exon-exon junction. In these rare occasions, the start and stop positions need to be split into two ranges.

ADD REPLY • link 7.2 years ago by I0110 ▴ 140

0

Entering edit mode

why not use the CDS annotation instead?

ADD REPLY • link 7.2 years ago by Michael 54k

0

Entering edit mode

Did you mean using CDS to annotate the start and stop codon positions? I think we would still encounter the same issue. For example, if I use the first three nucleotides of the starting CDS, it could still across exon-exon junction so I cannot simply create a range from the first CDS row of the gene. I would have to use information from two CDS rows.

ADD REPLY • link 7.2 years ago by I0110 ▴ 140