Question: Synteny Diagram In Biopython
gravatar for Pappu
5.1 years ago by
Pappu1.9k wrote:

I am trying to compare the synteny of a homologous gene of a set of embl files. I wrote the following script in Biopython which does not work when the protein is at the beginning/end of the plasmid: I only to display the region of interest and not the whole gene. For example I would like to get something like this:

-> -> -> => -> -> ->

which works fine when the protein/gen => is in the middle of the genome but if it in the end of the genome I get

 -> -> ->                                                                                                                                                   -> -> ->  =>

And the length of arrows become very short since I am plotting the whole genome.

If the gene is at the beginning of the genome I obtain

=> -> -> ->                                                                                                                                            -> -> ->

So it does not take into account the circular nature of the plasmid. program output

from Bio.SeqIO import parse 
from Bio.Graphics.GenomeDiagram import Diagram 

n=1    # name of each track 
cds=[] # index of all cds 
pro=[] # index of target genes

# protein ids to look for
id0=set( 'ACK98665.1 ACK98642.1 ACK98513.1'.split() )


for rec in parse('CP001187.embl', 'embl'):
    for j,i in enumerate(rec.features):      
        if i.type=='CDS':
    # index of all cds
    # index of protein ids
            if p in i.qualifiers and i.qualifiers[p][0] in id0:          
    print pro, cds
    for j in pro: # for each protein
        gdfs=gdd.new_track(2, start=0,greytrack=True).new_set() 
        for i in range(-3,4):
        if idx>=len(cds): # locus of cds 
                locus = cds[idx - len(cds) ] 
            locus = cds[idx]


gdd.draw( circular=0,     format='linear',fragments=1,)
gdd.write('test'+str(n)+'.pdf', 'PDF')
python biopython • 2.0k views
ADD COMMENTlink modified 4.0 years ago by Biostar ♦♦ 20 • written 5.1 years ago by Pappu1.9k

Showing a (hand drawn sketch) of the desired output would help. What do you mean by does not work? I suspect this is a bug in your code in the complicated list indexing (what is all this i/j stuff meant to achieve?). Why not add all the CDS features to the track immediately within the for loop over rec.features? Also there is an indentation problem with the loop for i in pro: where you might be trying to make a new track for each protein?

ADD REPLYlink written 5.1 years ago by Peter5.8k

I updated the question, the code and the output. I only want to show the CDS +/-3 of the genes of interest. So I first make the index of CDS and genes of interest. Copy paste from gedit seem to mess up the indentions.

ADD REPLYlink written 5.1 years ago by Pappu1.9k

No typos in the title also helps :) Fixed systeny -> synteny

ADD REPLYlink written 5.1 years ago by Neilfws48k
gravatar for xbello
4.9 years ago by
xbello20 wrote:

I get to this (ugly) hack, shifting the Feature locations before adding them to the graph. The code is rather long, and for sure it has much to polish. Starting from line 15:

features_to_add = []
for rec in parse('CP001187.embl', 'embl'):

    for j, i in enumerate(rec.features):
        if i.type == 'CDS':
            # index of all cds
        # index of protein ids
        if p in i.qualifiers and i.qualifiers[p][0] in id0:

    for j in pro: # for each protein
        for i in range(-3, 4):
            idx = cds.index(j) + i
        if idx >= len(cds): # locus of cds
            locus = cds[idx - len(cds)]
            locus = cds[idx]

        n += 1

def shift_segments(ranges_list, genome_size):
    starts = [x[0] for x in ranges_list]
    ends = [x[1] for x in ranges_list]

    segments = dict.fromkeys(starts, 0)

    for start in starts:
        for end in ends:
            if end > start:
                segments[start] += end - start
                # We have passed the ORI
                # Add the tail to the segment and restart
                segments[start] = (genome_size - start) + end

    # Detect the shortest start
    for segment_start, size in segments.items():
        if size == min(segments.values()):
            shortest_start = segment_start
            shift_size = genome_size - shortest_start

    for segment in ranges_list:
        if segment[0] == shortest_start:
            # The shortest segment get shifted to zero
            segments[segment[0]] = (-segment[0] + 1)
            # Others are shifted to start segment + (genome_size - start)
            segments[segment[0]] = (shift_size + segment[0])

    return segments

genome_size = len(rec.seq)

ranges = [
    (int(x.location.start), int(x.location.end)) for x in features_to_add]

shifted_coords = shift_segments(ranges, genome_size)

shifted_segments = []
for old_start, shift in shifted_coords.items():
    for feature in features_to_add:
        if feature.location.start == old_start:

for segment in shifted_segments:
    gdfs = gdd.new_track(n, start=0, greytrack=True).new_set()
    gdfs.add_feature(segment, sigil="ARROW")

gdd.draw(circular=0, format="linear", fragments=1)
gdd.write('test' + str(n) + '.pdf', 'PDF')

Basically, I'm extracting the Features from the genome file, but instead of immediately add them to the graphics:

  1. Save the interesting Features to a list.
  2. Calculate the shortest segment that includes them all (function shift_segments) and return a shiftting value.
  3. Shift the FeatureLocation of each CDS using a private method (not intended to do this!).
  4. Finally add these shifted features to the graphic.
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by xbello20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1198 users visited in the last hour