Question: Counting Contigs in Fasta/GBK File
0
gravatar for biohacker_tobe
11 weeks ago by
biohacker_tobe40 wrote:

I wanted to check the no of contigs present in either a FASTA or GBK file, I am aware of algorithms such as CheckM that will allow for this process, however is there a direct code to check no of contigs in a sequence directly with python or biopython?

checkm contigs genome • 146 views
ADD COMMENTlink modified 11 weeks ago by Joe16k • written 11 weeks ago by biohacker_tobe40

you can try with basic utilities in *nix.

ADD REPLYlink written 11 weeks ago by cpad011212k

like with grep commands etc?

ADD REPLYlink written 11 weeks ago by biohacker_tobe40
2

An easy grep solution to count entries in a genbank, is the number of LOCUS lines:

grep -c "LOCUS" multigenbank.gb

For a multifasta, you can use ^> instead of LOCUS as you have noted.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Joe16k
1
gravatar for cpad0112
11 weeks ago by
cpad011212k
India
cpad011212k wrote:

grep, sed, awk etc. Something like this:

$ cat test.fa 
>a
atgc
>b
atgc
>c
atgc

$ awk '/>/ {a++} END {print "number of sequences in this file: " a}' test.fa
number of sequences in this file: 3
ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by cpad011212k

yeah, I just tried this command this helped for determining the number of contigs per file (Just change the extension file for both cases):

Individual File:

$grep -c "^>" Streptomyces_sp_12.fna

Multiple Files :

$ grep -c "^>" *.fna

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by biohacker_tobe40

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink written 11 weeks ago by genomax80k
1
gravatar for Joe
11 weeks ago by
Joe16k
United Kingdom
Joe16k wrote:

Easy in BioPython.

 from Bio import SeqIO
 recs = list(SeqIO.parse('genbank.gbk', 'genbank'))
 len(recs)

This could be more memory efficient with an iterator, but this is a quick and easy way.

This is likely a more robust solution too, since *nix solutions require that you know your files very well, such that they don't have any nasty surprises in them.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Joe16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1317 users visited in the last hour