strange coordinates for Augustus-predicted gene
1
1
Entering edit mode
6.8 years ago
Ann ★ 2.3k

I'm writing a parser for Augustus GFF and have run into a weird case.

Has anyone else seen this?

At least one gene is listed with start greater than end, but only for the gene and transcript features. Everything else belonging to the gene has start less than end.

# start gene g34050
scaffold424|size244204    AUGUSTUS    gene    243273    179256    .    -    .    ID=g34050
scaffold424|size244204    AUGUSTUS    transcript    243273    179256    0.57    -    .    ID=g34050.t1;Parent=g34050
scaffold424|size244204    AUGUSTUS    stop_codon    243273    243275    .    -    0    Parent=g34050.t1
scaffold424|size244204    AUGUSTUS    intron    243687    244108    0.59    -    .    Parent=g34050.t1
scaffold424|size244204    AUGUSTUS    CDS    243273    243686    0.91    -    0    ID=g34050.t1.cds;Parent=g34050.t1
scaffold424|size244204    AUGUSTUS    CDS    244109    244204    0.6    -    0    ID=g34050.t1.cds;Parent=g34050.t1
scaffold424|size244204    AUGUSTUS    start_codon    244202    244204    .    -    0    Parent=g34050.t1
# end gene g34050

genefinder augustus • 1.7k views
0
Entering edit mode

OK I will do that. I have written to person who made the file and he should be getting back to me very soon.

4
Entering edit mode
6.8 years ago

augustus-web@uni-greifswald.de

0
Entering edit mode

I have a quick followup question about minus strand genes:

Here is a line of data:

scaffold3201|size9483    AUGUSTUS    transcript    9115    0    0.86    -    .    ID=g68389.t1;Parent=g68389


According to this, the feature begins at base 9115 and ends at base 0. However, GFF is one-based. The start and end positions of a feature are supposed to be positive integers. However, this transcript is transcribed from the minus strand, and so the coordinate system may be different.

By contrast, BED format uses interbase coordinates, where end is always greater than start, no matter if the gene is on the plus or minus strand.

Interbase (and bed files) define blocks, or ranges, of genomic sequence using start and end coordinate pairs (s,e) where s (start) indicates the index of the first base and e (end) indicates the index of the first base not included in the range. In addition, e >= s and the length of a range (the number of bases it covers) is always e - s.

(This is from a class I taught on bioinformatics programming many years ago :-)

I want to convert this line of data to BED format. What should be the correct coordinates for start and end in BED?

0
Entering edit mode

Sorry, I just realized Augustus GFF does not report exons - just CDSs.