Guys, I've wrote a script to extract sequences between the 23S rRNA and 16S rRNA gene in Python using Biopython. I can't figure out why, even if I've set explicit conditions, it keeps extracting randome sequences into the genome. Here's the code:
for f in glob.glob(cwd+'/genomes/*.gbk'):
gbank = SeqIO.parse(open(f,'r'), "genbank")
with open(output_file, "w") as nfh:
for genome in gbank:
for gene in genome.features:
if gene.type=='rRNA':
if gene.location.strand == -1: # wanted just the reverse strand
if 'product' in gene.qualifiers:
if '16S' in gene.qualifiers['product'][0]:
end = gene.location.nofuzzy_start
elif '23S' in gene.qualifiers['product'][0]:
start = gene.location.nofuzzy_end
if abs(start - end) < 50000:
intergenic_sequence.append(">%s|ITS_full_sequence|%s\n%s" % (genome.description, gene.location.strand, genome.seq[start:end]))
I've set gene.type to rRNA and it extract a sequence into a random CDS, nothing to do with ribosomal RNA. This problem happens with several genomes tested (RefSeq records, well annotated).

You could give my code a try: https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py
I've not tested it with rRNA features (or much at all frankly).
It might be worth reading your files in an interactive python session. It's possible the 'key'
rRNAisn't what's exactly in the genbank file (though I'm clutching at straws there a bit).At a glance, I can't see any obvious issue with your code. Perhaps take a step back and remove the
if '16S'/if '23S'conditionals, and just ensure you're getting results you'd expect to get at each stage, before adding in another layer of complexity/filtering.Side note, you dont need to use
open()withSeqIO.parse- it handles all that internally.The key
rRNAseems to work if I want to extract a 16S rRNA gene. Every genome tested was open previously withartemiswhere the search by key gives the right output every time. By far, I don't get why other condition values true even if they're explicitly wrong. I'm looking for gene with16Sinside product qualifiers, how could be even possible that apilusassociated CDS was extracted? May this be a bug or something else?Can you provide a link to one of your input genbanks and I'll have a little play with the data myself.
Sure, here's one. Bifidobacterium bifidum PRL2010 (https://www.ncbi.nlm.nih.gov/assembly/GCA_000165905.1)
Another small comment, I assume its elsewhere in your script, but you haven't declared the vector
intergenic_sequencein that code snippet, so as posted, it would fail based on that alone.intergenic sequenceis a list declared before, not copied just to be short.