Detect discrepancies in GFF files
0
0
Entering edit mode
15 months ago
nlehmann ▴ 130

Hello,

Is there a way to detect discrepancies in GFF files such as:

1- A gene has 3 transcripts, 1 is oriented on the forward strand, 2 are oriented on the reverse strand. I tried with Gffread, but it just affects the value of the first transcript met.

2- A transcript that's on chromosome #2 is associated to a gene that's on chromosome #1.

I've seen these situations on an official annotation from UCSC, so I wonder if there's a way to automatically detect and correct them. I've seen them using mikado compare, because it crashes when finding these situations. Maybe AGAT could handle that (I didn't find but I think it's worth asking) ?

Thanks for the help !

EDIT: AGAT actually detects discrepancies in agat_sp_compare_two_annotations.pl. It outputs "We fixed 1 wrong level2 location cases" for the first case (transcripts oriented both ways). But can we output the corrected files ?

RNA-Seq Gffread AGAT UCSC • 493 views
2
Entering edit mode

agat_sp_compare_two_annotations.pl is to detect fusion/split of genes between two annotation builds. All scripts with the _sp_ prefix pass through the the GFF AGAT parser (the others with _sq_ use the Bioperl parser). It is why you see fixes when using agat_sp_compare_two_annotations.pl. But if your goal is just to check the sanity of your file, as mentioned by @Michael Dondrup, agat_convert_sp_gxf2gxf.pl is the script to use (you will get a log file resuming the problem detected by AGAT and modifications made).

"We fixed 1 wrong level2 location cases" means the mRNA boundaries is wrong according to the sub features attached to it. (e.g one exon is over the mRNA boundary. So AGAT extends the mRNA to include the exon.)

I don't think cases described in 1) might be detected (btw never seen that before... if you could share one example it would be interesting).
For the case2, if the Parent/ID relationship is properly defined, or if they collectively share the same locus_tag/gene_id attribute, AGAT will not detect such case neither.

The solution would be to remove all Parent/ID relationship and locus_tag/gene_id attributes, and let AGAT reshape everything (using --merge_loci option to merge mRNA with overlap in their CDS under a uniq gene umbrella.). But this solution will wipe real cases of overlapping genes if any (Overlapping CDS of different genes is quite rare in eukaryote). If you don't use the --merge_loci option, each isoform will have its own gene feature as parent.

You should use GFF3toolkit it is much appropriate. I guess your errors are labelled by these error codes Ema0007 and Ema0009.

0
Entering edit mode

Thanks a lot for your replies. I will try with GFF3toolkit and agat_convert_sp_gxf2gxf.pl then.

Here is how to reproduce the errors:

> wget https://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/genes/galGal6.ncbiRefSeq.gtf.gz
> gffread -E --keep-genes galGal6.ncbiRefSeq.gtf -o- > galGal6.ncbiRefSeq.gff3
> grep 'CCL4' galGal6.ncbiRefSeq.gff3  #case 1
chr19   ncbiRefSeq.2020-04-01   gene    569498  578185  .   -   .   ID=CCL4;Name=CCL4
chr19   ncbiRefSeq.2020-04-01   transcript  569498  570362  .   -   .   ID=NM_001030360.2;Parent=CCL4
chr19   ncbiRefSeq.2020-04-01   transcript  576493  578185  .   +   .   ID=XM_015295666.2;Parent=CCL4
chr19   ncbiRefSeq.2020-04-01   transcript  576608  578177  .   +   .   ID=NM_204720.1;Parent=CCL4
> grep 'RAP1GAP2' galGal6.ncbiRefSeq.gff3  #case 2
chr21   ncbiRefSeq.2020-04-01   gene    6594099 9716788 .   +   .   ID=RAP1GAP2;Name=RAP1GAP2
chr21   ncbiRefSeq.2020-04-01   transcript  6594099 6601942 .   +   .   ID=XM_025142578.1;Parent=RAP1GAP2
chr19   ncbiRefSeq.2020-04-01   transcript  9664803 9716788 .   +   .   ID=XM_015295816.2;Parent=RAP1GAP2
chr19   ncbiRefSeq.2020-04-01   transcript  9664803 9716788 .   +   .   ID=XM_015295815.2;Parent=RAP1GAP2
chr19   ncbiRefSeq.2020-04-01   transcript  9669557 9716788 .   +   .   ID=XM_015295821.2;Parent=RAP1GAP2
chr19   ncbiRefSeq.2020-04-01   transcript  9669696 9716788 .   +   .   ID=XM_015295820.2;Parent=RAP1GAP2
chr19   ncbiRefSeq.2020-04-01   transcript  9674710 9716788 .   +   .   ID=XM_015295818.1;Parent=RAP1GAP2
chr19   ncbiRefSeq.2020-04-01   transcript  9674805 9716788 .   +   .   ID=NM_001031478.2;Parent=RAP1GAP2
chr19   ncbiRefSeq.2020-04-01   transcript  9679150 9716788 .   +   .   ID=XM_015295817.2;Parent=RAP1GAP2
chr19   ncbiRefSeq.2020-04-01   transcript  9692479 9716788 .   +   .   ID=XM_015295819.2;Parent=RAP1GAP2


Case1 also happens with galGal6.refGene.gtf.gz - UCSC downloads but not case2 - there might be other examples though. I think the main issue comes from the geneID being identical to geneName.

1
Entering edit mode

Which genome are we talking about? These deviations appear quite grave. In case these are errors, I would prefere to have them fixed upstream because even though we cannot think of any reason right now, there might well be one (fusion transcripts, trans-splicing, aligned experimental transcript sequence, circular transcripts, etc.).

If you need a quick fix, it looks like the AGAT script agat_convert_sp_gxf2gxf.pl could be the script that outputs a 'corrected' version. However, I would double-check if the output conforms to your expectations. I haven't used AGAT before, btw, but it looks very promising.

1
Entering edit mode

We're talking about chicken genome annotation from UCSC with NCBI RefSeq annotation. To get it: wget https://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/genes/galGal6.ncbiRefSeq.gtf.gz. A simple grep of CCL4 (case #1 about transcripts orientations) or RAP1GAP2 (case #2 about chromosomes number) will show you the discrepancies. It's even clearer when converting GTF to GFF3 first.

0
Entering edit mode

It looks like there are two copies of CCL4 gene oriented towards (and near) each other. Go this link and then search with CCL4 in search assembly box.

There is only one location for RAP1GAP2 on chr19. So I am not sure where the other annotation is coming from

0
Entering edit mode

Yes, there are 2 close by CCL4 genes, but does it really make sense ? I guess that using a better geneID instead of geneName would prevent (some of) these errors. Do you know of a tool to directly integrate geneID from Ensembl to these GTF annotations files ?

0
Entering edit mode

CCL4 copies are not overlapping. Since Chicken genome has been through a few builds the sequence in that region must be stable, if the two copies are being annotated as they are. Ensembl also has the two copies.

Annotation for RAPGAP2 on chr21 is based on NCBI's automated gnomon eukaryotic annotation pipeline. It may not have strong experimental support. So take that into account. Not sure why UCSC has included it in their RefSeq category.

NC_006108.5     Gnomon  gene    6594099 6601942 .       +       .       ID=gene-RAP1GAP2-2;Dbxref=CGNC:55278,GeneID:769475;Name=RAP1GAP2;gbkey=Gene;gene=RAP1GAP2;gene_biotype=protein_coding
NC_006108.5     Gnomon  mRNA    6594099 6601942 .       +       .       ID=rna-XM_025142578.1;Parent=gene-RAP1GAP2-2;Dbxref=GeneID:769475,Genbank:XM_025142578.1,CGNC:55278;Name=XM_025142578.1;gbkey=mRNA;gene=RAP1GAP2;model_evidence=Supporting evidence includes similarity to: 1 EST%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=RAP1 GTPase activating protein 2;transcript_id=XM_025142578.1