Synteny Diagram In Biopython
1
1
Entering edit mode
10.1 years ago
Pappu ★ 2.1k

I am trying to compare the synteny of a homologous gene of a set of embl files. I wrote the following script in Biopython which does not work when the protein is at the beginning/end of the plasmid: http://www.ebi.ac.uk/ena/data/view/CP001187&display=text I only to display the region of interest and not the whole gene. For example I would like to get something like this:

-> -> -> => -> -> ->

which works fine when the protein/gen => is in the middle of the genome but if it in the end of the genome I get

 -> -> ->                                                                                                                                                   -> -> ->  =>

And the length of arrows become very short since I am plotting the whole genome.

If the gene is at the beginning of the genome I obtain

=> -> -> ->                                                                                                                                            -> -> ->

So it does not take into account the circular nature of the plasmid. program output

from Bio.SeqIO import parse 
from Bio.Graphics.GenomeDiagram import Diagram 

n=1    # name of each track 
cds=[] # index of all cds 
pro=[] # index of target genes

# protein ids to look for
id0=set( 'ACK98665.1 ACK98642.1 ACK98513.1'.split() )

gdd=Diagram()
p='protein_id'

for rec in parse('CP001187.embl', 'embl'):
    for j,i in enumerate(rec.features):      
        if i.type=='CDS':
    # index of all cds
        cds.append(j)
    # index of protein ids
            if p in i.qualifiers and i.qualifiers[p][0] in id0:          
                pro.append(j) 
    print pro, cds
    for j in pro: # for each protein
        gdfs=gdd.new_track(2, start=0,greytrack=True).new_set() 
        for i in range(-3,4):
        idx=cds.index(j)+i
        if idx>=len(cds): # locus of cds 
                locus = cds[idx - len(cds) ] 
        else:
            locus = cds[idx]

            gdfs.add_feature(rec.features[locus],sigil="ARROW")
        n+=1 

gdd.draw( circular=0,     format='linear',fragments=1,)
gdd.write('test'+str(n)+'.pdf', 'PDF')
biopython python • 3.7k views
ADD COMMENT
0
Entering edit mode

Showing a (hand drawn sketch) of the desired output would help. What do you mean by does not work? I suspect this is a bug in your code in the complicated list indexing (what is all this i/j stuff meant to achieve?). Why not add all the CDS features to the track immediately within the for loop over rec.features? Also there is an indentation problem with the loop for i in pro: where you might be trying to make a new track for each protein?

ADD REPLY
0
Entering edit mode

I updated the question, the code and the output. I only want to show the CDS +/-3 of the genes of interest. So I first make the index of CDS and genes of interest. Copy paste from gedit seem to mess up the indentions.

ADD REPLY
0
Entering edit mode

No typos in the title also helps :) Fixed systeny -> synteny

ADD REPLY
0
Entering edit mode
9.9 years ago
xbello ▴ 20

I get to this (ugly) hack, shifting the Feature locations before adding them to the graph. The code is rather long, and for sure it has much to polish. Starting from line 15:

features_to_add = []
for rec in parse('CP001187.embl', 'embl'):

    for j, i in enumerate(rec.features):
        if i.type == 'CDS':
            # index of all cds
            cds.append(j)
        # index of protein ids
        if p in i.qualifiers and i.qualifiers[p][0] in id0:
            pro.append(j)

    for j in pro: # for each protein
        for i in range(-3, 4):
            idx = cds.index(j) + i
        if idx >= len(cds): # locus of cds
            locus = cds[idx - len(cds)]
        else:
            locus = cds[idx]

        features_to_add.append(rec.features[locus])
        n += 1

def shift_segments(ranges_list, genome_size):
    starts = [x[0] for x in ranges_list]
    ends = [x[1] for x in ranges_list]

    segments = dict.fromkeys(starts, 0)

    for start in starts:
        for end in ends:
            if end > start:
                segments[start] += end - start
            else:
                # We have passed the ORI
                # Add the tail to the segment and restart
                segments[start] = (genome_size - start) + end

    # Detect the shortest start
    for segment_start, size in segments.items():
        if size == min(segments.values()):
            shortest_start = segment_start
            shift_size = genome_size - shortest_start

    for segment in ranges_list:
        if segment[0] == shortest_start:
            # The shortest segment get shifted to zero
            segments[segment[0]] = (-segment[0] + 1)
            # Others are shifted to start segment + (genome_size - start)
        else:
            segments[segment[0]] = (shift_size + segment[0])

    return segments

genome_size = len(rec.seq)

ranges = [
    (int(x.location.start), int(x.location.end)) for x in features_to_add]

shifted_coords = shift_segments(ranges, genome_size)

shifted_segments = []
for old_start, shift in shifted_coords.items():
    for feature in features_to_add:
        if feature.location.start == old_start:
            shifted_segments.append(
                feature._shift(shifted_coords[feature.location.start]))

for segment in shifted_segments:
    gdfs = gdd.new_track(n, start=0, greytrack=True).new_set()
    gdfs.add_feature(segment, sigil="ARROW")

gdd.draw(circular=0, format="linear", fragments=1)
gdd.write('test' + str(n) + '.pdf', 'PDF')

Basically, I'm extracting the Features from the genome file, but instead of immediately add them to the graphics:

  1. Save the interesting Features to a list.
  2. Calculate the shortest segment that includes them all (function shift_segments) and return a shiftting value.
  3. Shift the FeatureLocation of each CDS using a private method (not intended to do this!).
  4. Finally add these shifted features to the graphic.
ADD COMMENT

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6