Hello everyone,
I'm new to the biostars community and also to the bioinformatics field but I already have a question. Currently I face a problem when I try to run pfam_scan.pl. After translating all CDS from my input GFF3 genome file by using gffread, I want to identify all domains in my proteome with PfamScan. But the script stops immediately printing an error:
'FATAL: Sequence identifiers must be unique. Your fasta file contains two sequences with the same id'
Sure, this error message is self-explanatory but I don't know how to solve this issue. Should I alter the options in gffread or is the GFF3 file which I obtained from ensembl.org not suited for this purpose? Or could these sequences with same IDs occur due to trans-splicing? I don't think that I can just delete every problematic transcript entry in my fasta file as this would surely introduce some bias to my data.
Any help is much appreciated!
No, you should not simply delete the redundant ones (immediately). First have a look why they are redundant. Can you track down which IDs they are and then post the relevant GFF lines for those entries?
Thank you for the fast reply. There are a total of 67 sequence IDs occurring more than one time in the fasta file. Thus, I will just post two examples here: GFF3_example
I don't see any redundancy in the GFF file at first sight. Can you post the IDs from the fasta file that are redundant (or at least the ones that are relevant for the GFF file you provided)?