Question

Cuffmerge: GFF Error: duplicate/invalid 'transcript'

0

Entering edit mode

9.2 years ago

pld 5.1k

I am getting the following error using cuffmerge (2.2.1):

[Mon Apr 18 07:07:41 2016] Beginning transcriptome assembly merge
-------------------------------------------

[Mon Apr 18 07:07:41 2016] Preparing output location cuffmerge/
[Mon Apr 18 07:07:57 2016] Converting GTF files to SAM
[07:07:57] Loading reference annotation.
GFF Error: duplicate/invalid 'transcript' feature ID=id102945
        [FAILED]
Error: could not execute gtf_to_sam

The reference GFF came from NCBI. Here's what I get if I grep for "ID=id102945":

NW_015493306.1  Gnomon  C_gene_segment  47653   72466   .       -       .      ID=id102945;Parent=gene4402;Dbxref=GeneID:107506276;gbkey=C_region;gene=LOC107506276;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns

Any suggestions? I can't seem to find anything. I didn't have any issues when running tophat or cufflinks using the same GFF.

For what it is worth, I am unable to validate the file using the GFF validator at genometools.org (with the Seq Ontology option selected):

Validation unsuccessful!

GenomeTools error: the child feature with type 'V_gene_segment' on line 17186 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/_9407.gff3.gz" is not part-of parent feature with type 'gene' given on line 17185 (according to type checker 'OBO file /home/satta/genometools_for_web/gtdata/obo_files/so.obo')

EDIT: Still no luck, the "filtered" GTF cause tophat to present errors. I found a different GTF available for the same genome, NCBI has several. However, I am still getting an error about the same entry: [Fri Apr 22 07:49:38 2016] Converting GTF files to SAM

[07:49:38] Loading reference annotation.
GFF Error: duplicate/invalid 'transcript' feature ID=id102945
        [FAILED]
Error: could not execute gtf_to_sam

The entry in question (new GFF):

NW_015493310.1  Gnomon  exon    2165474 2165578 .       +       .       ID=id102945;Parent=rna8992;Dbxref=GeneID:107506309,Genbank:XM_016136996.1;gbkey=mRNA;gene=REV3L;product=REV3 like%2C DNA directed polymerase zeta catalytic subunit%2C transcript variant X3;transcript_id=XM_016136996.1

What I am having trouble understanding is that this error happens if I supply cuffmerge a reference sequence, reference GFF, both or nothing. So I'm assuming it is a problem with the output of cufflinks.

RNA-Seq cuffmerge cufflinks gtf_to_sam • 7.2k views

ADD COMMENT • link updated 6.7 years ago by 2469296049 • 0 • written 9.2 years ago by pld 5.1k

score 0 · Answer 1 · 2016-04-18

0

Entering edit mode

9.2 years ago

andrew.j.skelton73 6.6k

If I remember right, it might be something to do with the biotype ("C_gene_segment"). As far as I can remember Cufflinks only looks at "exon" attributes, and a few others such as UTRs, but it's specific in this. Cufflinks / Tophat should basically ignore anything that isn't an attribute they're looking for (i.e., exon, 5'UTR, etc), Cuffmerge in my experience is more problematic with this sort of thing, so it might be an idea to grep out anything in your GFF that isn't exon, or a UTR biotype. See if that does the trick

ADD COMMENT • link 9.2 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

I just tried running it again, but without supplying a reference GFF or genomic FASTA. I get the same error, so it looks like the problem lies with the GTF files generated by cufflinks:

[Mon Apr 18 07:22:20 2016] Preparing output location cuffmerge-noref/
Warning: no reference GTF provided!
[Mon Apr 18 07:24:19 2016] Converting GTF files to SAM
[07:24:19] Loading reference annotation.
GFF Error: duplicate/invalid 'transcript' feature ID=id102945
        [FAILED]
Error: could not execute gtf_to_sam

However when I grep the transcripts.gtf files from my samples, nothing pops up.

I'll give filtering a try, really hoping that I don't have to start over...

EDIT: I've tried filtering a few times, but without any success. After doing some more hunting, I noticed that cufflinks has a utility called gffread, which is able to parse/filter/validate GFF files. I've generated a "fixed" GFF file that passes validation, but cuffmerge is still failing. I'll start over from scratch with this new GFF, hopefully that will work.

ADD REPLY • link 9.2 years ago by pld 5.1k

0

Entering edit mode

Hi Joe, did you manage to make it work eventually? I came up to the same conclusions as you: the NCBI gtf file isn't problematic, but cufflinks gtf are. I removed the "duplicate/invalid" featureID from them, still same error message popped up with another problematic feature. gffread transcripts.gtf gives the same error message: GFF Error: duplicate/invalid 'transcript' feature ID=rna67847

ADD REPLY • link 9.1 years ago by Kate ▴ 20

0

Entering edit mode

Yes, let me dig up what I ended up doing. You can't simply remove the problematic features since they might be the parent of something else. You sort of have to rebuild it and filter out those features as you go.

I think it is a mixture of both NCBI and the cufflinks GFF/GTF parsing. What is really problematic is that the documentation claims that all cuff* uses gtfread to parse GFF/GTF files, yet they don't all throw the same errors (or any errors) when parsing.

ADD REPLY • link 9.1 years ago by pld 5.1k

score 0 · Answer 2 · 2018-10-16

0

Entering edit mode

6.7 years ago

2469296049 • 0

Have you solved this issue? if yes, please reply me directly, thank you!

ADD COMMENT • link 6.7 years ago by 2469296049 • 0