Question

duplicate/invalid 'transcript' feature in cuffmerge

0

Entering edit mode

6.1 years ago

bioinfo_ga ▴ 70

Hi !! I am doing alignment against banana genome using hisat downloaded from ( http://banana-genome-hub.southgreen.fr/). Further i used cufflinks (2.2.1)for expression estimation which runs fine but in cuffmerge step it gives the following error "duplicate/invalid 'transcript' feature ID=Ma03_t01040.3". I also converted gff to gtf but same error persists and remove this ID from gff results in the same error with another ID. Kindly give your inputs.

RNA-Seq • 2.3k views

ADD COMMENT • link 6.1 years ago by bioinfo_ga ▴ 70

0

Entering edit mode

Would you please run a grep "Ma03_t01040.3" on your gff file. The error is "duplicate/invalid 'transcript'", you need to investigate that first. Also there are some thread on this subject https://biostar.usegalaxy.org/p/17359/ https://github.com/cole-trapnell-lab/cufflinks/issues/77

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k

0

Entering edit mode

grep "Ma03_t01040.3" gives the following result

> chr03 manual_curation exon    836456  836913  .   -   .   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation exon    837103  837214  .   -   .   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation exon    837626  837723  .   -   .   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation exon    837832  837939  .   -   .   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation exon    838029  838067  .   -   .   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation exon    838163  838234  .   -   .   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation exon    838316  838579  .   -   .   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation exon    839379  839646  .   -   .   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation CDS 836665  836913  .   -   0   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation CDS 837103  837214  .   -   1   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation CDS 837626  837723  .   -   0   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation CDS 837832  837939  .   -   0   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation CDS 838029  838067  .   -   0   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation CDS 838163  838234  .   -   0   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation CDS 838316  838579  .   -   0   transcript_id
    > "Ma03_t01040.3";
    > chr03 manual_curation CDS 839379  839570  .   -   0   transcript_id
    > "Ma03_t01040.3";

REmoving these give same error for some other ID

ADD REPLY • link 6.1 years ago by bioinfo_ga ▴ 70

0

Entering edit mode

This indent is hard to read, why "Ma03_t01040.3" is on a new line ? Do you have a link to your gff, maybe this one ( http://banana-genome-hub.southgreen.fr/sites/banana-genome-hub.southgreen.fr/files/data/gff3/version2/musa_acuminata_v2.gff3 ) ?

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Yes the same gff was used for analysis

ADD REPLY • link 6.1 years ago by bioinfo_ga ▴ 70

0

Entering edit mode

Yes, because all your transcript names are duplicate not only this one

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k

0

Entering edit mode

For all the features of a given transcript we have the same name.

ADD REPLY • link 6.1 years ago by bioinfo_ga ▴ 70

0

Entering edit mode

As I can read out there, everyone pick up a gff or gtf coming from ensembl and it works well.

So, let's try with the ensembl plants database :

ftp://ftp.ensemblgenomes.org/pub/plants/release-38/gff3/musa_acuminata/Musa_acuminata.MA1.38.gff3.gz

This should works, because in your gff file I think lines with the 9th column starting with "Parent=" annoyed cuffmerge. Now in the gff from ensembl these lines are removed

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k