Error when running PfamScan: Fasta file contains two sequences with same ID
0
0
Entering edit mode
5.9 years ago

Hello everyone,

I'm new to the biostars community and also to the bioinformatics field but I already have a question. Currently I face a problem when I try to run pfam_scan.pl. After translating all CDS from my input GFF3 genome file by using gffread, I want to identify all domains in my proteome with PfamScan. But the script stops immediately printing an error:

'FATAL: Sequence identifiers must be unique. Your fasta file contains two sequences with the same id'

Sure, this error message is self-explanatory but I don't know how to solve this issue. Should I alter the options in gffread or is the GFF3 file which I obtained from ensembl.org not suited for this purpose? Or could these sequences with same IDs occur due to trans-splicing? I don't think that I can just delete every problematic transcript entry in my fasta file as this would surely introduce some bias to my data.

Any help is much appreciated!

genome sequence proteome pfamscan fasta • 1.4k views
ADD COMMENT
0
Entering edit mode

No, you should not simply delete the redundant ones (immediately). First have a look why they are redundant. Can you track down which IDs they are and then post the relevant GFF lines for those entries?

ADD REPLY
0
Entering edit mode

Thank you for the fast reply. There are a total of 67 sequence IDs occurring more than one time in the fasta file. Thus, I will just post two examples here: GFF3_example

ADD REPLY
0
Entering edit mode

I don't see any redundancy in the GFF file at first sight. Can you post the IDs from the fasta file that are redundant (or at least the ones that are relevant for the GFF file you provided)?

ADD REPLY

Login before adding your answer.

Traffic: 2960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6