Error running gffcompare
1
0
Entering edit mode
7.0 years ago

I am trying to run the following gffcompare command:

gffcompare -r ref.gff -G -o merged stringtie-merged.gtf

ref.gff - downloaded from NCBI

string-merged.gtf - obtained from stringtie --merge command

Error encountered : GFF Error: overlapping duplicate transcript feature (ID=gene29892)

When I grep "gene29892" from both the ref.gff and stringtie-merged.gtf

From ref.gff

NC_007957.1 RefSeq  gene    74631   74744   .   +   .   ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=1/2

NC_007957.1 RefSeq  gene    146276  147073  .   +   .   ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=2/2

NC_007957.1 RefSeq  CDS 74631   74744   .   +   0   ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11

NC_007957.1 RefSeq  CDS 146276  146507  .   +   0   ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11

NC_007957.1 RefSeq  CDS 147048  147073  .   +   2   ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11

NC_007957.1 RefSeq  exon    146276  146507  .   +   .   ID=id318095;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=1;gbkey=exon;gene=rps12

NC_007957.1 RefSeq  exon    147048  147073  .   +   .   ID=id318096;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=2;gbkey=exon;gene=rps12

From stringtie-merged.gtf

NC_007957.1 StringTie   transcript  74631   147073  1000    +   .   gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892"; 

NC_007957.1 StringTie   exon    74631   74744   1000    +   .   gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892"; 

NC_007957.1 StringTie   exon    146276  146507  1000    +   .   gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "2"; gene_name "rps12"; ref_gene_id "gene29892"; 

NC_007957.1 StringTie   exon    147048  147073  1000    +   .   gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "3"; gene_name "rps12"; ref_gene_id "gene29892"; 

NC_007957.1 StringTie   transcript  146276  147073  1000    +   .   gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892"; 

NC_007957.1 StringTie   exon    146276  147073  1000    +   .   gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892";

What could be possibly wrong as I cannot see any duplicate values! There are identical "starts" "stops" but the tags/labels are different.

gffcompare gff stringtie RNA-Seq • 4.0k views
ADD COMMENT
1
Entering edit mode
7.0 years ago

How come you don't see that you have a duplicate gene line in this one:

NC_007957.1 RefSeq  gene    74631   74744   .   +   .   ID=gene29892;Dbxref=GeneID:4025012 ...

NC_007957.1 RefSeq  gene    146276  147073  .   +   .   ID=gene29892;Dbxref=GeneID:4025012 ...

Look at the ID, it's the same, and they're two "gene" features in the same scaffold, only at different positions.

In addition to this, your first file is a GFF3 format file and the second one is a GTF format file, you should have them in the same format to compare them. The GTF always starts with "gene_id" and "transcript_id", while the GFF3 is less strict and has usually "ID", "Name", "Parent", and other invoices.

The program doesn't allow you to compare them because you have two "gene" lines in the first file that have the same gene ID (ID=gene29892).

ADD COMMENT
0
Entering edit mode

But then they are indeed 2 different genes on the same scaffold. By "duplicates" I assume that the "ID", start and stop should be same. Anyway, thanks for the answer. Any suggestion on how to bypass this issue? The GFF3 file has been obtained from NCBI.

Additionally, from this link ReadMe.md file), it looks like GFF or GTF can be used without any issue.

ADD REPLY
0
Entering edit mode

By "duplicates" I assume that the "ID", start and stop should be same

For a program that reads a GFF/GFF3/GTF file, duplicate means "having the same ID", because that is the key which is used in the dictionary. Plus, in the definition of the GFF 9th field, you'll find that the "ID" must always be unique, while the "Name" can be not.

This is most likely a mistake by who made that GFF file, not everything you find on NCBI is glittering gold, it's always better to be careful!

ADD REPLY

Login before adding your answer.

Traffic: 2813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6