Question

Splitting a Genbank file into smaller files

0

Entering edit mode

4.4 years ago

Bhushan Dhamale • 0

Hello. I have a genbank file that I have been trying to run in HMMER using the following command:

hmmscan --domtblout NC_000962.3.final.gbk.out -E 0.000001 prot_profile.hmm NC_000962.3.final.gbk

However, since the file is too big for HMMER, I get the following error:

Fatal exception (source file p7_pipeline.c, line 697): 
Target sequence length > 100K, over comparison pipeline limit. 
(Did you mean to use nhmmer/nhmmscan?) 
Aborted (core dumped)

I need to split this file into smaller files for the HMMER run. The problem is, I would need the ORIGIN field (containing the nucleotide sequence) at the bottom of the file split into relevant chunks into the subfile as well. This is where I am getting stuck.

I have already looked at GenBank Parser and seqretsplit to no avail.

I would greatly appreciate some help with this. Thank you.

split genbank HMMER • 2.2k views

ADD COMMENT • link updated 4.4 years ago by Mensur Dlakic ★ 29k • written 4.4 years ago by Bhushan Dhamale • 0

score 1 · Answer 1 · 2021-01-28

I don't think hmmscan is meant for searching protein HMMs against nucleotide database - HMMer programs do not translate sequences on the fly. Since there are no proteins with more than 100K residues, the program decides that you probably have a wrong database or a wrong program. Other than that, HMMer programs have no database size limitation that I am aware of.

If your HMM is protein-based like it seems from its name, you need a protein version of the same database. Otherwise, follow the error suggestion and use nhmmscan if this is a nucleotide-based HMM. Either way, it is probably best to use FASTA database.

Full genome (nucleotides):

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz

Proteome:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_protein.faa.gz