How can I find KO IDs for ORF sequences in a large FASTA file?
1
0
Entering edit mode
4 months ago
Nikesh • 0

Hi,

I have a protein sequence file (about 14.9 GB) in FASTA format. Each sequence has an ORF ID in the header line. I want to find the KEGG Orthology (KO) IDs that match these ORFs.

Can someone please suggest a tool or workflow that can handle large files and help me map ORF IDs to KO IDs?

Thanks in advance!

KEGG ORF • 898 views
ADD COMMENT
0
Entering edit mode
4 months ago
Mensur Dlakic ★ 30k

There is a tool made exactly for that purpose:

https://github.com/takaram/kofam_scan

ADD COMMENT
0
Entering edit mode

Hi t tried to work with this, but there are some errors occurring, Do you have code or any source material to work on this ? Mensur Dlakic

ADD REPLY
0
Entering edit mode

I don't think anyone can help you when the only feedback you provide is "there are some errors occurring." If I told you that I tried to build a house but there were some problems, would you be able to offer any advice to me?

What I do know is when I installed all the dependencies outlined on that GitHub page and provided correct input files, everything worked. An educated guess is that you didn't do one or the other.

ADD REPLY
0
Entering edit mode

Mensur Dlakic

Hi, I set up the environment in HCC, and my FASTA file contains 98 sequences. This is my SLURM script, but I’ve tried running it changing time duration without success.

#!/bin/bash
#SBATCH --job-name=kofamscan
#SBATCH --output=kofamscan.out
#SBATCH --error=kofamscan.err
#SBATCH --time=5:59:00
#SBATCH --mem=32G
#SBATCH --cpus-per-task=8


source ~/miniconda3/etc/profile.d/conda.sh
conda activate kofamscan_env

./exec_annotation \
  -o kofam_output.txt \
  -f detail-tsv \
  -p profiles/ \
  -k ko_list \
  --cpu 8 \
  test.faa

It keeps giving the following error, I and also tried changing cpu allocation.

“slurmstepd: error: * JOB 10654468 ON c2023 CANCELLED AT 2025-06-10T21:46:45 DUE TO TIME LIMIT *”

What should I do? What could be the issue?

ADD REPLY
0
Entering edit mode

JOB 10654468 ON c2023 CANCELLED AT 2025-06-10T21:46:45 DUE TO TIME LIMIT *”

You are asking for one minute less than 6 hours in your SLURM request so the job is getting killed once that limit is reached. Ask for more time in --time=1-0 (this would be one day).

ADD REPLY
0
Entering edit mode

What GenoMax said. I suggest you inquire about the SLURM time limit and set it to a maximum value allowed. This would be 6 days:

#SBATCH --time=6-00:00:00

Also, why not ask for more than 8 CPUs?

ADD REPLY

Login before adding your answer.

Traffic: 4495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6