I'm trying to figure out the disparity between the 2 record counts when I'm using grep and this simple Biopython record counter script. I'm using the following gbk file: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/RNA/rna.gbk.gz
grep -c "//" rna.gbk, I get 176432 records.
When using the following script, I get 176426 records.
from Bio import SeqIO import sys import os filename = sys.argv count = 0 if filename.endswith('.gbk'): filetype = "genbank" elif filename.endswith('.fasta'): filetype = "fasta" for record in SeqIO.parse(filename, filetype): count = count + 1 print("There were " + str(count) + " records in file " + filename)
So there's a 6 record difference between the 2 methods. Why is this?
grep -c ACCESSION rna.gbk gives 176426 records and so does
grep LOCUS rna.gbk | grep -c RNA