Why are there different record counts when using grep vs biopython's SeqIO.parse()?
0
0
Entering edit mode
5.0 years ago
ahtmatrix • 0

I'm trying to figure out the disparity between the 2 record counts when I'm using grep and this simple Biopython record counter script. I'm using the following gbk file: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/RNA/rna.gbk.gz

When using grep -c "//" rna.gbk, I get 176432 records. When using the following script, I get 176426 records.

from Bio import SeqIO
import sys
import os

filename = sys.argv[1]

count = 0

if filename.endswith('.gbk'):
    filetype = "genbank"
elif filename.endswith('.fasta'):
    filetype = "fasta"

for record in SeqIO.parse(filename, filetype):
    count = count + 1


print("There were " + str(count) + " records in file " + filename)

So there's a 6 record difference between the 2 methods. Why is this?

EDIT: using grep -c ACCESSION rna.gbk gives 176426 records and so does grep LOCUS rna.gbk | grep -c RNA

biopython grep SeqIO • 1.6k views
ADD COMMENT
1
Entering edit mode

Maybe the file has some duplicates? grep LOCUS rna.gbk | grep -c RNA will be consistent with BioPython.

ADD REPLY
0
Entering edit mode

how does that command work? its grabbing all LOCUS and then piping back to grep to grab RNA within LOCUS?

ADD REPLY
0
Entering edit mode

Yes, exactly like you said.

ADD REPLY

Login before adding your answer.

Traffic: 2524 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6