Counting Contigs in Fasta/GBK File
21 months ago

I wanted to check the no of contigs present in either a FASTA or GBK file, I am aware of algorithms such as CheckM that will allow for this process, however is there a direct code to check no of contigs in a sequence directly with python or biopython?

checkm contigs genome
you can try with basic utilities in *nix.

like with grep commands etc?

An easy grep solution to count entries in a genbank, is the number of LOCUS lines:

grep -c "LOCUS" multigenbank.gb


For a multifasta, you can use ^> instead of LOCUS as you have noted.

21 months ago

grep, sed, awk etc. Something like this:

$cat test.fa >a atgc >b atgc >c atgc$ awk '/>/ {a++} END {print "number of sequences in this file: " a}' test.fa
number of sequences in this file: 3

yeah, I just tried this command this helped for determining the number of contigs per file (Just change the extension file for both cases):

Individual File:

$grep -c "^>" Streptomyces_sp_12.fna Multiple Files :$ grep -c "^>" *.fna

21 months ago
Joe 19k

Easy in BioPython.

 from Bio import SeqIO
recs = list(SeqIO.parse('genbank.gbk', 'genbank'))
len(recs)


This could be more memory efficient with an iterator, but this is a quick and easy way.

This is likely a more robust solution too, since *nix solutions require that you know your files very well, such that they don't have any nasty surprises in them.