I guess my problem come from my poor programming skills, because logically it is quite straightforward.
I have a genbank file, containing few CDS and few other information on the Pfam domain identified in the encoded proteins. I would like to extract the intergenic region between specific genes.
considering the for genes: cg_23, cg_24, cg_25, cg_26 (indicate with the respective locus tag) I want to have the intergenic region between cdg_23 and cg_24, cg_24 and cg_25, cg_25 ad cg_26.
Using biopython I can extract the position of the CDS in my file.
for f in gb_record.features: if f.type == "CDS": print f.qualifiers["locus_tag"], f.location ['ctg47_159'] [883:1729](+) ['ctg47_160'] [1771:2932](-) ['ctg47_161'] [3001:3718](+) ['ctg47_162'] [3714:4194](+) ['ctg47_163'] [4256:4664](+) ['ctg47_164'] [4787:5657](+) ['ctg47_165'] [5740:7549](+) ['ctg47_166'] [7570:8908](-) .....
Now from here I should get the region between the genes; e.g. 1729-1771. Any idea how can move forward?
Someone else thought about this long ago. Here a really nice script:
It works quite good. I only want to modify the output in the way that I get the locus_tag in the output fasta. Any suggestion?
I added the following lines before the two 'for' loop that look in the two strands (modifying -1 to 1 for the other strand):
tag =  for f in seq_record.features: if f.type == "CDS" and f.strand == -1: tag.append(f.qualifiers["locus_tag"])
I then modified the id section as follow:
SeqRecord(intergene_seq,id="%s-%s" % (tag[i],tag[i+1]), description="%s %d-%d %s" % (seq_record.name, last_end+1, this_start,strand_string)))
And finally I changed the 'intergene_length' to -10 in hte function definition. In this way I can keep the consequentiality of the tag.
It is kind of dodgy but it does the job.