Sort gff3 on chromosome, position and then featuretype (gene, mRNA, exon, CDS)
1
0
Entering edit mode
11 months ago
William ★ 4.8k

Is it possible to sort a gff3 on chromosome, position and then featuretype (gene, mRNA, exon, CDS).

The order of the featuretypes is important when converting a gff file to a gtf file with gffread.

If the mRNA record is before the gene record, then the gene_id and gene_name are not set in the last column for the exons and CDS in the gtf file.

Bedtools sort only sort on chromosome and position, not on featuretype.

The goal is to convert any gff3 file (that has genes, mRNA, exon and CDS records) into a GTF file that has exon and CDS records. The exon and CDS records in the GTF should have the transcript_id, gene_id and gene_name set.

When converting from gff3 to gtf via gffread, this requires the gene to be before the mRNA record.

Chr_01  BGI_GENES       exon    2972   3128   .       -       .       transcript_id "M0001760.1"; gene_id "G0001760"; gene_name "G0001760";

This because of a downstream requirement for the gene_id and gene_name in the GTF.

gff3 • 708 views
ADD COMMENT
0
Entering edit mode

Note, depending on the source of your GFF3 file, it is possible that the 'mRNA' feature type is expanded to use terms such as transcript, miRNA, lncRNA, etc. Things can get a bit tricky with such a list. A hacky way of doing this would be to use something like awk to append an alphabet to make sorting easy; something like wgene, xmRNA, yexon and zCDS. Then, you just do an alphasort followed by another step of sed to remove the extra prefix.

ADD REPLY
2
Entering edit mode
11 months ago
Juke34 ★ 5.3k

Try agat_sp_gxf_to_gff3.pl agat_convert_sp_gxf2gxf.pl from AGAT.
see here for an example of result: https://github.com/NBISweden/AGAT/wiki/Topological-sorting-of-gff-features

If your goal is to create a GTF you can use directly agat_convert_sp_gff2gtf.pl

ADD COMMENT
0
Entering edit mode

Thank you, I'll give both options in AGAT a try.

ADD REPLY
0
Entering edit mode

agat_sp_gxf_to_gff3.pl did fix the sorting issue, and also added explicit UTR records (which are already implied by the exon to CDS differences). This seems to have fixed the downstream issue also that depends on the correct sorting. Strange thing is that after running agat_sp_gxf_to_gff3.pl agat_sp_compare_two_annotations.pl finds that 10 out of the 20K genes are different now. This is of course a very low relative number, still is there a way to output the 10 genes that are found to be different?

ADD REPLY
1
Entering edit mode

How different are they? Usually it is just the boundaries that changes. They might be originally wrong in the sense they are shorter or longer than the defined sub-feature locations (I regularly see cases like that. The script fix them). If it is the case, it is mentioned in Check8 We fixed 10 wrong level1 position cases. You can add verbosity with -v to get more information. Otherwise you will have to wait a next release. I had other requests for a log file summarising the modifications performed by AGAT. I'm working on it.

ADD REPLY
0
Entering edit mode

There is no information under Check8. I'll wait for the next release for the extra logging. I also ran the gff3 file fixed by AGAT trough our internal gff3 validator, it complains about duplicate UTR ids. After removing the UTR records (which are implied by the exon and CDS differences), the validator finds the file created by AGAT to be valid.

Some genes seem to have multiple (partially) non coding exons defined, these should have unique IDs.

Duplicate Attribute ID found: 'nbis-five_prime_utr-9488' at line 4
Duplicate Attribute ID found: 'nbis-three_prime_utr-3150' at line 18
Duplicate Attribute ID found: 'nbis-five_prime_utr-10655' at line 21
Duplicate Attribute ID found: 'nbis-five_prime_utr-14430' at line 24
Duplicate Attribute ID found: 'nbis-three_prime_utr-14681' at line 28

Chr_01  GENES       five_prime_UTR  3385   3403   .       -       .       ID=nbis-five_prime_utr-9488;Parent=00020.1
Chr_01  GENES       five_prime_UTR  3922   4089   .       -       .       ID=nbis-five_prime_utr-9488;Parent=00020.1
ADD REPLY
1
Entering edit mode

From the gff3 specification

In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines.

It is the case here so it is not an error. For discontinuous features the AGAT parser accepts both as input (shared ID or uniq ID by chunk). But when the AGAT parser has to create discontinuous features it sets a shared ID.

agat_sp_manage_IDs.pl set uniq ID for discontinuous features by default. It make me think I could add an option to get also shared ID in that script.

ADD REPLY
0
Entering edit mode

It is in the master branch. To try it you need to git clone and make install AGAT in the agat conda env you already have. I have renamed agat_sp_gxf_to_gff3.pl into agat_convert_sp_gxf2gxf.pl

ADD REPLY

Login before adding your answer.

Traffic: 2150 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6