Question: How to add the start and stop position information to a gff file?
0
gravatar for I0110
2.6 years ago by
I0110120
United States
I0110120 wrote:

Hi,

I am using the new tomato ITAG3.0 annotation, but the gff file does not contain the rows for the start and stop codon positions. Is there a way to fix that with R or Python? In other words, is there a way to use the existing range information, including "gene","mRNA","CDS", "exon" to generate a gff file with the start and stop positions?

Thanks! Larry

python assembly R genome • 1.4k views
ADD COMMENTlink modified 6 days ago by Juke-342.6k • written 2.6 years ago by I0110120
1

Hi, did you solve the problem? would you mind to share the solution, pls? Thanks..

ADD REPLYlink written 7 weeks ago by pigeon041110
2
gravatar for Juke-34
6 days ago by
Juke-342.6k
Sweden
Juke-342.6k wrote:

I have a perl script for that purpose called gff3_sp_add_start_and_stop.pl

gff3_sp_add_start_and_stop.pl --gff infile.gff --fasta infile.fasta -o output.gff

You can specify the codon table to use (1 by default). It deals with start or stop codon that would be split over several exons. The script is available at the GAAS repository.

ADD COMMENTlink written 6 days ago by Juke-342.6k
1
gravatar for brendanmwee
16 months ago by
brendanmwee10
brendanmwee10 wrote:

14 months later... I am currently dealing with this same issue. The approaches I have come up with are just imputing the beginning and end codons of the exon. This doesn't work very well, but it allows my pipeline to progress.

The other ideas I have are using the exon interval to find the nearest AUG to the exon and write in a Start and stop codon entry to the gtf. or use the entries in my current GTF that match CCDS entries and find the start and stop codons in the reference GTF.

I will post whatever we come up with in the end

ADD COMMENTlink written 16 months ago by brendanmwee10

what about this: ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_gene_models.gff

It has gene, mRNA, exon, CDS tags.

For a old one: ftp://ftp.solgenomics.net/tomato_genome/annotation/beta_release/ITAG2.90_updated-ITAG2.40/ITAG2.90_gene_models.gff

ADD REPLYlink modified 16 months ago • written 16 months ago by cpad011211k
1
gravatar for darcy.ab.jones
18 days ago by
darcy.ab.jones20 wrote:

It's a bit old but I use aegean/CanonGFF and genometools to add the features that are typically meant to be inferred. E.G. UTRs, introns, start, and stop

It does rely on the GFF adhering the standards/conventions though, so you'll need gene>mRNA>CDS/exon gene structures, and the CDS should contain the stop codon.

Typically I would do something like this.

gt gff3 -tidy -sort -retainids my.gff3 | canon-gff3 -i - > my_with_stops.gff3
ADD COMMENTlink written 18 days ago by darcy.ab.jones20
0
gravatar for Michael Dondrup
2.6 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

That is implicit, the start codon should be at the first 5' CDS position, the stop codon at the 3' postion of the last CDS. Mind strand and eventual phase. Genome browsers normally don't require the start and stop codon information. Also, there is possibly a reason to not annotate the start and stop codons. Current gene models are notoriously error prone and often based on automatic prediction only, stating an exact start codon f.e. implies a possibly undue confidence. Experimental techniques such as ribosome profiling have often delivered surprising result with respect to unexpected translation initiation sites (I'll find you a citation for that...).

(If CDS are not annotated, you have to subtract the 5'/3'Utr's from the terminal exons.)

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Michael Dondrup46k

I know these facts, but is there a way to automatically do that with R or python? Thanks a lot!

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by I0110120

Sure there is, you just need to write a little script e.g. based on GRanges in R ;)

ADD REPLYlink written 2.6 years ago by Michael Dondrup46k

A little hint will be much appreciated. Thanks!

ADD REPLYlink written 2.6 years ago by I0110120

It's a few lines in perl but I don't have time now. If you don't have a solution by tomorrow, send me a gentle reminder.

ADD REPLYlink written 2.6 years ago by Michael Dondrup46k

Thanks! I tried to write it in R, but it is more difficult than I thought. The problem I have is the start and stop codons on the exon-exon junction. In these rare occasions, the start and stop positions need to be split into two ranges.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by I0110120

why not use the CDS annotation instead?

ADD REPLYlink written 2.6 years ago by Michael Dondrup46k

Did you mean using CDS to annotate the start and stop codon positions? I think we would still encounter the same issue. For example, if I use the first three nucleotides of the starting CDS, it could still across exon-exon junction so I cannot simply create a range from the first CDS row of the gene. I would have to use information from two CDS rows.

ADD REPLYlink written 2.6 years ago by I0110120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1733 users visited in the last hour