Question

Cuffmerge GFF Error: duplicate/invalid 'transcript'

0

Entering edit mode

7.5 years ago

mgoldste • 0

Hi everyone, I am trying to run cuffmerge with the catfish genome from https://www.ncbi.nlm.nih.gov/genome/198, with the gff file ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/660/625/GCF_001660625.1_IpCoco_1.2/GCF_001660625.1_IpCoco_1.2_genomic.gff.gz but I keep getting the same error. I have tried converting the gff to gtf, but this error message keeps popping up:

GFF Error: duplicate/invalid 'transcript' feature ID=id47527 [FAILED] Error: could not execute gtf_to_sam

Can anyone point me to a solution?

RNA-Seq genome • 3.2k views

ADD COMMENT • link updated 7.5 years ago by Kevin Blighe 89k • written 7.5 years ago by mgoldste • 0

score 1 · Answer 1 · 2018-01-28

This error has been reported in multiple places. On the GitHub thread, one of the developers has provided assistance: Cuffmerge GFF Error: duplicate/invalid 'transcript' #77.

I, in addition, suggest that you upgrade to using HISAT2 and StringTie, which are upgrades of TopHat2 / Cufflinks. This in itself may solve the issue.

Finally, why not explore your GFF by using grep to extract the features with ID id47527 and then try to understand why the error may have occurred. Become your own investigator.

Kevin

-------------------------

Curiosity got the better of me and here are the entries for this:

NC_030417.1 Gnomon  C_gene_segment  22233544    22237573    .   -   .   ID=id47527;Parent=gene2052;Dbxref=GeneID:108277590
NC_030417.1 Gnomon  exon    22237278    22237573    .   -   .   ID=id47528;Parent=id47527;Dbxref=GeneID:108277590
NC_030417.1 Gnomon  exon    22236835    22237158    .   -   .   ID=id47529;Parent=id47527;Dbxref=GeneID:108277590
NC_030417.1 Gnomon  exon    22235738    22236055    .   -   .   ID=id47530;Parent=id47527;Dbxref=GeneID:108277590
NC_030417.1 Gnomon  exon    22234411    22234698    .   -   .   ID=id47531;Parent=id47527;Dbxref=GeneID:108277590
NC_030417.1 Gnomon  exon    22233544    22234110    .   -   .   ID=id47532;Parent=id47527;Dbxref=GeneID:108277590

It does not look like a typical entry. There's also a note written into the GFF for this transcript:

Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: added 477 bases not found in genome assembly;exception=annotated by transcript or proteomic data;

I'm imagining that it's an antibody gene, part of the constant (C) region.