Question: InterProScan standalone - error connected to non-unique fasta identifiers
0
gravatar for al-ash
23 months ago by
al-ash70
Japan/Okinawa/OIST
al-ash70 wrote:

Problem: My InterProScan with nucleotide fastas as input among which there are multiple fastas with non-unique names consistently returns no output

Description: I'm running InterProScan (InterProScan-5.21-60.0) search on linux in standalone mode. In test searches, when I'm looking only for GO terms and search Pfam database and I use the test multifasta provided in InterProScan package (test_nt_redundant.fasta) which includes also some fastas different in sequence but with non-unique names (see below), the analysis runs without any problems.

interproscan.sh -i test_nt_redundant.fasta -b output -goterms -appl Pfam  -t n

The fasta headers in the test file are:

>A2YIW7
>Bob
>ENA|AACH01000026|AACH01000026.1 Saccharomyces mikatae IFO 1815 YM4906-Contig2858, whole genome shotgun sequence.
>ENA|AACH01000027|AACH01000027.2 Saccharomyces mikatae IFO 1815 YM4906-Contig2858, whole genome shotgun sequence.
>Henry
>reverse translation of P22298
>reverse translation of P22298
>Wilf

However, when I run the same analysis with a set of 15 fastas which I'd like to annotate and which contains also some fastas with non-unique identifiers, I'm consistently receiving following massage and interproscan ends without any output:

Found 3 non unique identifier(s). These identifiers do have different sequences, within the FASTA nucleotide sequence input file.
    Please find below a list of detected identifiers:
    100646091
    100646573
    100645787
    InterProScan will shutdown, because there is no way to map nucleic sequences and predicted proteins.

Remarkably, even the returned list of non-unique identifiers is not complete. (see below for the list of fasta headers in the 15fasta set):

>100645110
>100645230
>100645431
>100645550
>100645666
>100645666
>100645666
>100645787
>100645787
>100645973
>100646091
>100646091
>100646214
>100646573
>100646573

Additionaly, when I remove from the 15fasta set the non-unique fastas, the analysis runs without any problem - so I guess the problem is somehow connected to the number of non-unique fasta identifiers in the input.

I'm wondering what might be the source of this error and how to solve it? Thanks in advance for any hints.

ADD COMMENTlink written 23 months ago by al-ash70
1

Just add a unique identifier to non-unique fasta headers. It makes sense to stop on non-unique identifiers since if IDs are not unique, you wouldn't be able to unambiguously associate the results with a sequence.

ADD REPLYlink written 23 months ago by Jean-Karim Heriche17k

Thanks for the reply! I thought that InterProScan should be able to take care of the non-unique identifiers (e.g. by adding number suffix) but now I went through the manual once again and indeed it is not (https://github.com/ebi-pf-team/interproscan/wiki/ScanNucleicAcidSeqs). The reason, why it did not return error with the sample set was, that the two sequences with identical identifiers in this multifasta has also identical sequences in which case InterProScan just merges it into one sequence.

ADD REPLYlink written 23 months ago by al-ash70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1132 users visited in the last hour