Question

Python Script To Format Blast Output To Get Number Of Duplicated Genes

0

Entering edit mode

10.5 years ago

armahmud3 • 0

How to format blastn output file to get number of duplicated genes within a genome. Output file will be number of duplicated genes at various % of sequence identity ( 0 to 100%, increment of 10). I need a python script to write this and output file should consist 10 lines and save as csv file.

Example of output file % sequence identity and gene alignment (before comma) and number of duplicated genes (after comma)

[0,13456] [10,234] [20,234] [30,200] [40,190] [50,187] [60,187] [70,100] [80,95 ] [90,55] [100,45]

please kindly help me. Thanks in advance.

python genes • 5.5k views

ADD COMMENT • link updated 10.5 years ago by David W 4.9k • written 10.5 years ago by armahmud3 • 0

0

Entering edit mode

I remember that you asked similar question earlier. Could you please provide some examples input file ?

ADD REPLY • link 10.5 years ago by always_learning ★ 1.1k

0

Entering edit mode

I am in a serious problem regarding this python script. Don't get solution yet and this is why I am asking repeatedly for help in this forum. Input file should be a single genome file in genbank format like NC_011000.gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Burkholderia_cenocepacia_J2315_uid57953/ I am a beginner in programing language. I try to solve this problem by reading biopython tutorial , cookbook. unfortunately I did not write the script yet.

ADD REPLY • link 10.5 years ago by armahmud3 • 0

0

Entering edit mode

Why you want to start task from .gbk files ? Why not .fasta files ? Just wanna reconfirm that you were saying about BLASTN ?

ADD REPLY • link 10.5 years ago by always_learning ★ 1.1k

score 0 · Answer 1 · 2013-11-02

I presume you are trying to BLAST a set of genome sequences against themselves, and find apparent paralogs?

In that case you first need to run blastn locally, NCBI gives you loads of documentation on this. If you want to use Biopython to do the processing then have the blast executable spit out an xml file (-outfmt= 5).

You should read the Biopython tuotrial section on BLAST parsing to get an idea of what each blast record is representing. But to get results on the top hit for each record you could do something like

from Bio.Blast import NCBIXML

recs = NCBIXML.parse(open("path_to_your_blast.xml"))

def score_top_hit(rec):  
    """ """                               
    if rec.alignments:
        top_hit = rec.alignments[0].hsps[0]
        score =  float(top_hit.identities)/ top_hit.align_length
    else:
        score = 0.0
    return score

  scores = [score_top_hit(r) for r in recs]

There is almost certainly a more memory/time efficient way of getting the number of sequences above a given cut than this, but it'll work

cuts = [i/100.0 for i in range(0,110, 10)] #hacky way of making a range of floats - must be nicer way!
zip(cuts,  [ sum([c > s for s in scores]) for c in cuts])

EDIT to cover some of armahmud3's comments below

Hi armahmud3. I gather this is a homework question. It's great that your teacher encourages you to seek help form others, and Biostar can certainly help you learn bioinformatics. But I don't think anyone is served by providing with a "copy paste" a solution to your problem. Instead, I'll give you some pointers about how to attack the problem.

From your problem statement: you want to BLAST a set of gene sequences against each other, collect some information form the resulting output, and summarise it in a .csv file. Although Biopython does have "wrapper" scripts to help you run BLAST locally, you don't need to use them. Instead you can simply use the BLAST executables form the commandline. The NCBI's documentation of the BLAST programmes is very good. You'll want to download your sequences, use makeblastdb to create a database and blastn to run your search.

The code I gave you above will count up the number of sequences where the "top hit" is above or below each threshold. You should also pay around a bit with the BLAST parser, to get a feel of what each record represents. In an interactive session, check out:

>>> recs = NCBIXML.parse(open("path_to_your_blast.xml"))
>>> r = recs.nex()
>>> dir(r)
>>> r.alignments

Finally you want to spit the results out as .csv file. Read up on python input and output handling to see how to do this.