Question

changing ID in an existing GFF3 file

0

Entering edit mode

4.4 years ago

Ric ▴ 430

I have an annotation file in GFF3 file, but I do not have the amino acid and cds sequences anymore. Is there a tool which can retrieve those files from a genome in FASTA format and a GFF3 file?

Thank you in advance

gene annotation • 3.8k views

ADD COMMENT • link updated 4.2 years ago by lieven.sterck 15k • written 4.4 years ago by Ric ▴ 430

0

Entering edit mode

Could anyone please revert the question and title to the previous version?

ADD REPLY • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

Hi Ric

we were able to trace back the original post title, however we don't have the ability to get the original post content back. Perhaps you are best placed to re-create it?

ADD REPLY • link 4.2 years ago by lieven.sterck 15k

score 0 · Answer 1 · 2019-12-05

0

Entering edit mode

4.4 years ago

Juke34 8.5k

You can try with agat_sp_manage_IDs.pl from AGAT

ADD COMMENT • link 4.4 years ago by Juke34 8.5k

0

Entering edit mode

I installed AGAT. Could you please show me how to change the id from g65212 to AT1G01010.1, AT1G01020.1, AT1G01030.1 with agat_sp_manage_IDs.pl ?

ADD REPLY • link 4.3 years ago by Ric ▴ 430

0

Entering edit mode

Did you invoke the help to see?

ADD REPLY • link 4.3 years ago by Juke34 8.5k

0

Entering edit mode

I looked at the help but I did not understand it.

ADD REPLY • link 4.3 years ago by Ric ▴ 430

0

Entering edit mode

I agree, it should be improved :)

agat_sp_manage_IDs.pl --gff yourfile.gff --prefix AT1 -o result.gff

For the first gene the ID will be AT1G1 for the second AT1G2...
For the first mRNA the ID will be AT1M1 for the second AT1M2...
I hope it is what you want.

ADD REPLY • link 4.3 years ago by Juke34 8.5k

0

Entering edit mode

I updated my question which might explain better how I would like to change the IDs.

ADD REPLY • link 4.3 years ago by Ric ▴ 430

0

Entering edit mode

AGAT does not in that way currently but it Could be updated in a future version to follow this convention if it is something largely used (several large DBs)

ADD REPLY • link 4.3 years ago by Juke34 8.5k

0

Entering edit mode

For example, the Arabidopsis group does it ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff . Does AGAT can anything close to it?

ADD REPLY • link 4.3 years ago by Ric ▴ 430

0

Entering edit mode

I updated my question (update 2). However, why the first gene has an ID of AT1G00000068467, but its mRNA has an ID of ID=AT1M00000076570? Should gene ID not start with 1 because of --nb 1?

ADD REPLY • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

with or without --nb 1 it should start numbering at 1. What if you grep '0000001' in the file? you should find something starting at 1...

ADD REPLY • link 4.2 years ago by Juke34 8.5k

0

Entering edit mode

I found NbV1Ch04 AUGUSTUS gene 61731467 61732149 0.15 + . ID=AT1G00000010000. Why are they 0000 after the 1 and why NbV1Ch04 is the first one rather NbV1Ch04?

ADD REPLY • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

It shouldn't be the only result this is the 10000 gene but if you grep for "AT1G00000000001" you will find the first. I just checked and it works for me. The only problem, is that when it propagates the ID it does not do it in the same way (order) it prints the result at the End. So the first line in the output is not necessarily the first number. I will open an issue in the repo to improve that for the next release.

ADD REPLY • link 4.2 years ago by Juke34 8.5k

0

Entering edit mode

I found it in my last chromosome:

NbV1Ch19        AUGUSTUS        gene    97401   99254   0.03    -       .       ID=AT1G00000000001
NbV1Ch19        AUGUSTUS        mRNA    97401   99254   0.03    -       .       ID=AT1M00000000001;Parent=AT1G00000000001
NbV1Ch19        AUGUSTUS        exon    97401   99007   .       -       .       ID=AT1E00000000001;Parent=AT1M00000000001
NbV1Ch19        AUGUSTUS        exon    99101   99254   .       -       .       ID=AT1E00000000002;Parent=AT1M00000000001
NbV1Ch19        AUGUSTUS        CDS     98823   99007   0.36    -       2       ID=AT1C00000000001;Parent=AT1M00000000001
NbV1Ch19        AUGUSTUS        CDS     99101   99230   0.68    -       0       ID=AT1C00000000002;Parent=AT1M00000000001
NbV1Ch19        AUGUSTUS        five_prime_utr  99231   99254   0.25    -       .       ID=AT1F00000000001;Parent=AT1M00000000001
NbV1Ch19        AUGUSTUS        intron  99008   99100   0.69    -       .       ID=AT1I00000000001;Parent=AT1M00000000001
NbV1Ch19        AUGUSTUS        start_codon     99228   99230   .       -       0       ID=AT1S00000000001;Parent=AT1M00000000001
NbV1Ch19        AUGUSTUS        stop_codon      98823   98825   .       -       0       ID=AT1ST00000000001;Parent=AT1M00000000001
NbV1Ch19        AUGUSTUS        three_prime_utr 97401   98822   0.05    -       .       ID=AT1T00000000001;Parent=AT1M00000000001

Why there is such big difference in IDs between a gene and its sub-features?

NbV1Ch01        AUGUSTUS        gene    97932   99714   0.06    -       .       ID=AT1G00000068467
NbV1Ch01        AUGUSTUS        mRNA    97932   99714   0.06    -       .       ID=AT1M00000076570;Parent=AT1G00000068467
NbV1Ch01        AUGUSTUS        exon    97932   98571   .       -       .       ID=AT1E00000339808;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        exon    98679   98844   .       -       .       ID=AT1E00000339809;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        exon    99134   99325   .       -       .       ID=AT1E00000339810;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        exon    99417   99714   .       -       .       ID=AT1E00000339811;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        CDS     98177   98571   1       -       2       ID=AT1C00000294005;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        CDS     98679   98844   1       -       0       ID=AT1C00000294006;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        CDS     99134   99325   1       -       0       ID=AT1C00000294007;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        CDS     99417   99668   0.65    -       0       ID=AT1C00000294008;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        five_prime_utr  99669   99714   0.14    -       .       ID=AT1F00000101217;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        intron  98572   98678   1       -       .       ID=AT1I00000123933;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        intron  98845   99133   1       -       .       ID=AT1I00000123934;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        intron  99326   99416   1       -       .       ID=AT1I00000123935;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        start_codon     99666   99668   .       -       0       ID=AT1S00000057436;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        stop_codon      98177   98179   .       -       0       ID=AT1ST00000057445;Parent=AT1M00000076570
NbV1Ch01        AUGUSTUS        three_prime_utr 97932   98176   0.44    -       .       ID=AT1T00000096168;Parent=AT1M00000076570

ADD REPLY • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

Let's say for 1 gene you have 10 exon, when you are at your 150 gene, its first exon will be numbered 15000 and its last exon 15010. So it is just related of how many of numbered feature has been met before.

ADD REPLY • link 4.2 years ago by Juke34 8.5k

0

Entering edit mode

What confused me on my previous commend pasted output data is that gene id is AT1G00000068467, the mRNA is AT1M00000076570 and the first exon ID is AT1E00000339808. Why is it not for mRNA ID AT1G00000068468 and for the first exon ID AT1G00000068469?

ADD REPLY • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

Because it is numbered by feature type (3rd column) independently, here an example:

NbV1Ch01        AUGUSTUS        gene    97932   99714   0.06    -       .       ID=gene1
NbV1Ch01        AUGUSTUS        mRNA    97932   99714   0.06    -       .       ID=mRNA1
NbV1Ch01        AUGUSTUS        exon    97932   98571   .       -       .       ID=exon1
NbV1Ch01        AUGUSTUS        exon    98679   98844   .       -       .       ID=exon2
NbV1Ch01        AUGUSTUS        exon    99134   99325   .       -       .       ID=exon3
NbV1Ch01        AUGUSTUS        exon    99417   99714   .       -       .       ID=exon4
NbV1Ch01        AUGUSTUS        CDS     98177   98571   1       -       2       ID=cds1
NbV1Ch01        AUGUSTUS        CDS     98679   98844   1       -       0       ID=cds2
NbV1Ch01        AUGUSTUS        CDS     99134   99325   1       -       0       ID=cds3
NbV1Ch01        AUGUSTUS        CDS     99417   99668   0.65    -       0       ID=cds4
NbV1Ch01        AUGUSTUS        mRNA    97935   99711   0.06    -       .       ID=mRNA2
NbV1Ch01        AUGUSTUS        exon    97935   98571   .       -       .       ID=exon5
NbV1Ch01        AUGUSTUS        exon    98679   98844   .       -       .       ID=exon6
NbV1Ch01        AUGUSTUS        exon    99134   99325   .       -       .       ID=exon7
NbV1Ch01        AUGUSTUS        exon    99417   99711   .       -       .       ID=exon8
NbV1Ch01        AUGUSTUS        CDS     98177   98571   1       -       2       ID=cds5
NbV1Ch01        AUGUSTUS        CDS     98679   98844   1       -       0       ID=cds6
NbV1Ch01        AUGUSTUS        CDS     99134   99325   1       -       0       ID=cds7
NbV1Ch01        AUGUSTUS        gene    109665  112554  0.04    -       .       ID=gene2
NbV1Ch01        AUGUSTUS        mRNA    109665  112554  0.04    -       .       ID=mRNA3
NbV1Ch01        AUGUSTUS        exon    109665  110489  .       -       .       ID=exon9
NbV1Ch01        AUGUSTUS        exon    110608  111042  .       -       .       ID=exon10
NbV1Ch01        AUGUSTUS        exon    111592  111844  .       -       .       ID=exon11
NbV1Ch01        AUGUSTUS        exon    112128  112554  .       -       .       ID=exon12
NbV1Ch01        AUGUSTUS        CDS     109839  110489  0.69    -       0       ID=cds8
NbV1Ch01        AUGUSTUS        CDS     110608  111042  0.21    -       0       ID=cds9
NbV1Ch01        AUGUSTUS        CDS     111592  111844  0.23    -       1       ID=cds10
NbV1Ch01        AUGUSTUS        CDS     112128  112450  0.95    -       0       ID=cds11

ADD REPLY • link 4.2 years ago by Juke34 8.5k

0

Entering edit mode

I understand, but would it be less confusing if numbered by feature type would be dependent?

ADD REPLY • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

It sounds coherent, It is noted, I will modify the code for a next release. Would you check here it is what you expect? https://github.com/NBISweden/AGAT/issues/16

ADD REPLY • link 4.2 years ago by Juke34 8.5k

0

Entering edit mode

Could you not simply replace NbV1Ch01 withchr1 (and so on for others) using sed or similar tool? Changing AUGUSTUS to TAIR10 could be done in a similar way but does it make sense. That is just an identifier any way.

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

I am sorry for the confusion, but I would like to change the IDs in the last column.

ADD REPLY • link 4.3 years ago by Ric ▴ 430