Question

[solved]Extract region between specific genes from genbank file

0

Entering edit mode

9.4 years ago

dago ★ 2.8k

I guess my problem come from my poor programming skills, because logically it is quite straightforward.

I have a genbank file, containing few CDS and few other information on the Pfam domain identified in the encoded proteins. I would like to extract the intergenic region between specific genes.

For example:

Considering the for genes: cg_23, cg_24, cg_25, cg_26 (indicate with the respective locus tag) I want to have the intergenic region between cdg_23 and cg_24, cg_24 and cg_25, cg_25 ad cg_26.

I looked many other similar posts (e.g. upstream and downstream regions extraction , Extract User Defined Region From An Fasta File), but I cannot really find a way to apply these tools/approaches to my case.

Any advice?

Using biopython I can extract the position of the CDS in my file.

for f in gb_record.features:
     if f.type == "CDS":
             print f.qualifiers["locus_tag"], f.location

['ctg47_159'] [883:1729](+)
['ctg47_160'] [1771:2932](-)
['ctg47_161'] [3001:3718](+)
['ctg47_162'] [3714:4194](+)
['ctg47_163'] [4256:4664](+)
['ctg47_164'] [4787:5657](+)
['ctg47_165'] [5740:7549](+)
['ctg47_166'] [7570:8908](-)

.....

Now from here I should get the region between the genes; e.g. 1729-1771. Any idea how can move forward?

EDIT

Someone else thought about this long ago. Here a really nice script: http://biopython.org/wiki/Intergenic_regions

It works quite good. I only want to modify the output in the way that I get the locus_tag in the output fasta. Any suggestion?

EDIT

I added the following lines before the two 'for' loop that look in the two strands (modifying -1 to 1 for the other strand):

tag = []
    for f in seq_record.features:
        if f.type == "CDS" and f.strand == -1:
          tag.append(f.qualifiers["locus_tag"])

I then modified the id section as follow:

SeqRecord(intergene_seq,id="%s-%s" % (tag[i],tag[i+1]),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))

And finally I changed the 'intergene_length' to -10 in hte function definition. In this way I can keep the consequentiality of the tag.

It is kind of dodgy but it does the job.

python genome biopython R • 3.0k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.4 years ago by dago ★ 2.8k

1

Entering edit mode

What is all this spam below?

ADD REPLY • link 9.4 years ago by dago ★ 2.8k

0

Entering edit mode

which organism ?

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Newly annotated bacterial genomes produced in my group.

ADD REPLY • link 9.4 years ago by dago ★ 2.8k

0

Entering edit mode

it's not clear to me what's your input (genbank with a "bacterial genomes produced in your group" ?) and what's your problem as your able to extract the locations of the genes with python.

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks for the answer. I have a genbank of bacterial genomes. I can extract the the location of the genes, but I do not manage to find a way to use them to extract the intergeneic region between specific genes.

ADD REPLY • link 9.4 years ago by dago ★ 2.8k