Splitting a Genbank file into smaller files
1
0
Entering edit mode
3.2 years ago

Hello. I have a genbank file that I have been trying to run in HMMER using the following command:

hmmscan --domtblout NC_000962.3.final.gbk.out -E 0.000001 prot_profile.hmm NC_000962.3.final.gbk

However, since the file is too big for HMMER, I get the following error:

Fatal exception (source file p7_pipeline.c, line 697): 
Target sequence length > 100K, over comparison pipeline limit. 
(Did you mean to use nhmmer/nhmmscan?) 
Aborted (core dumped)

I need to split this file into smaller files for the HMMER run. The problem is, I would need the ORIGIN field (containing the nucleotide sequence) at the bottom of the file split into relevant chunks into the subfile as well. This is where I am getting stuck.

I have already looked at GenBank Parser and seqretsplit to no avail.

I would greatly appreciate some help with this. Thank you.

split genbank HMMER • 1.7k views
ADD COMMENT
1
Entering edit mode
3.2 years ago
Mensur Dlakic ★ 27k

I don't think hmmscan is meant for searching protein HMMs against nucleotide database - HMMer programs do not translate sequences on the fly. Since there are no proteins with more than 100K residues, the program decides that you probably have a wrong database or a wrong program. Other than that, HMMer programs have no database size limitation that I am aware of.

If your HMM is protein-based like it seems from its name, you need a protein version of the same database. Otherwise, follow the error suggestion and use nhmmscan if this is a nucleotide-based HMM. Either way, it is probably best to use FASTA database.

Full genome (nucleotides):

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz

Proteome:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_protein.faa.gz

ADD COMMENT

Login before adding your answer.

Traffic: 1919 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6