Splitting a Genbank file into smaller files
Entering edit mode
6 months ago

Hello. I have a genbank file that I have been trying to run in HMMER using the following command:

hmmscan --domtblout NC_000962.3.final.gbk.out -E 0.000001 prot_profile.hmm NC_000962.3.final.gbk

However, since the file is too big for HMMER, I get the following error:

Fatal exception (source file p7_pipeline.c, line 697): 
Target sequence length > 100K, over comparison pipeline limit. 
(Did you mean to use nhmmer/nhmmscan?) 
Aborted (core dumped)

I need to split this file into smaller files for the HMMER run. The problem is, I would need the ORIGIN field (containing the nucleotide sequence) at the bottom of the file split into relevant chunks into the subfile as well. This is where I am getting stuck.

I have already looked at GenBank Parser and seqretsplit to no avail.

I would greatly appreciate some help with this. Thank you.

split genbank HMMER • 342 views
Entering edit mode
6 months ago
Mensur Dlakic ★ 12k

I don't think hmmscan is meant for searching protein HMMs against nucleotide database - HMMer programs do not translate sequences on the fly. Since there are no proteins with more than 100K residues, the program decides that you probably have a wrong database or a wrong program. Other than that, HMMer programs have no database size limitation that I am aware of.

If your HMM is protein-based like it seems from its name, you need a protein version of the same database. Otherwise, follow the error suggestion and use nhmmscan if this is a nucleotide-based HMM. Either way, it is probably best to use FASTA database.

Full genome (nucleotides):





Login before adding your answer.

Traffic: 1792 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6