Question: [solved]Extract region between specific genes from genbank file
0
gravatar for dago
3.0 years ago by
dago2.5k
Germany
dago2.5k wrote:

I guess my problem come from my poor programming skills, because logically it is quite straightforward.

I have a genbank file, containing few CDS and few other information on the Pfam domain identified in the encoded proteins. I would like to extract the intergenic region between specific genes.

For example:

considering the for genes: cg_23, cg_24, cg_25, cg_26 (indicate with the respective locus tag) I want to have the intergenic region between cdg_23 and cg_24, cg_24 and cg_25, cg_25 ad cg_26.

I looked many other similar posts (e.g. upstream and downstream regions extraction , https://www.biostars.org/p/10469/), but I cannot really find a way to apply these tools/approaches to my case.

Any advise?

Using biopython I can extract the position of the CDS in my file.

for f in gb_record.features:
     if f.type == "CDS":
             print f.qualifiers["locus_tag"], f.location

['ctg47_159'] [883:1729](+)
['ctg47_160'] [1771:2932](-)
['ctg47_161'] [3001:3718](+)
['ctg47_162'] [3714:4194](+)
['ctg47_163'] [4256:4664](+)
['ctg47_164'] [4787:5657](+)
['ctg47_165'] [5740:7549](+)
['ctg47_166'] [7570:8908](-)

.....

Now from here I should get the region between the genes; e.g. 1729-1771. Any idea how can move forward?

 

EDIT

 

Someone else thought about this long ago. Here a really nice script:

http://biopython.org/wiki/Intergenic_regions.

It works quite good. I only want to modify the output in the way that I get the locus_tag in the output fasta. Any suggestion?

EDIT

I added the following lines before the two 'for' loop that look in the two strands (modifying -1 to 1 for the other strand):

tag = []
    for f in seq_record.features:
        if f.type == "CDS" and f.strand == -1:
          tag.append(f.qualifiers["locus_tag"])

I then modified the id section as follow:

SeqRecord(intergene_seq,id="%s-%s" % (tag[i],tag[i+1]),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))

And finally I changed the 'intergene_length' to -10 in hte function definition. In this way I can keep the consequentiality of the tag.

It is kind of dodgy but it does the job.

R biopython python genome • 1.1k views
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by dago2.5k
1

What is all this spam below?

ADD REPLYlink written 3.0 years ago by dago2.5k

which organism ?

ADD REPLYlink written 3.0 years ago by Pierre Lindenbaum116k

Newly annotated bacterial genomes produced in my group.

ADD REPLYlink written 3.0 years ago by dago2.5k

it's not clear to me what's your input (genbank with a "bacterial genomes produced in your group" ?) and what's your problem as your able to extract the locations of the genes with python.

ADD REPLYlink written 3.0 years ago by Pierre Lindenbaum116k

Thanks for the answer. I have a genbank of bacterial genomes. I can extract the the location of the genes, but I do not manage to find a way to use them to extract the intergeneic region between specific genes.

ADD REPLYlink written 3.0 years ago by dago2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 640 users visited in the last hour