Question: parsing too large blast result with bioperl OR other methods?
gravatar for J.F.Jiang
5.7 years ago by
J.F.Jiang820 wrote:

Hi all,

Recently I am dealing with bunch of genes to design the appropriate primers.

However, it is still hard for me to obtain the homology information of the primers.

For example, I need to design a pair primers for one exon of the gene.

I firstly get all possible primers with predefined length, e.g. 18-30 bps, and then use blastall -p blastn (or megablast) with -e 1 -W 8 to determine whether the primers have homogenous seqs. However, for those >10000 primers, the blast out file was larger than 200M, which requires longer time to parse using Bio::SearchIO module. And sometimes even crash the memory. Moreover, blasting those primer seqs within 18-30 bps are danger because shorter seqs will sometimes fail due to unkown reasons.

Another method is to blast the whole exon regions with parameter -e 0.1 -W 11, however, it will generate huge output and it will take long time to parse the blast file, and to determine whether the primer region falls into homologous part.


Till now, I have not obtained any good method to fix such problem.

If anyone experienced such issue, can you plz tell me how?





Although we could firstly define those nts belong to repeat regions using repeatMasker,

and then use -F parameter in blast to neglect these reigons, those repeat regions, however, will sometimes do not share too much homologous sequences.

This is the method that I can find now, but is not perfect.

Hope someone could provide some suggestions to better improve the results.

blast bioperl • 1.5k views
ADD COMMENTlink modified 5.7 years ago by Bioinformatics_NewComer320 • written 5.7 years ago by J.F.Jiang820
gravatar for Bioinformatics_NewComer
5.7 years ago by
Genomic Island
Bioinformatics_NewComer320 wrote:

Like Istvan Albert said, have your output in a tabular format. Your blast calling command should somewhat look like:


blastp -query $file_1 -db $file_2 -out outFile -outfmt 6 qseqid sseqid  bitscore evalue pident

blastp or blastn depending on your need.


ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Bioinformatics_NewComer320


the -outfmt parameter can bring the outfile in tab format, however, I can not use Bioperl to parse the file to get to know which nts are identical to the input sequence.

The tabular format will only give how many nts are identical, how many gaps have been inserted. Although generally we could know the query start~end and the hit start~end, the can not be exactly be used to show the identity of part of the input sequence, e.g. from position 20-38.


ADD REPLYlink written 5.7 years ago by J.F.Jiang820

the tabular output has many options and you can customize it to output other fields. For example qlen produces the query length, length produces the alignment length, qseq shows the aligned part of the query sequence etc. that may or may not be sufficient for what you are looking for 

But if you need all details of the alignment then it won't work.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Istvan Albert ♦♦ 84k

Yes, all I want to know is which nts within the query sequence are exactly same to the hit seqs, which is same as the output of using Bio::SearchIO to get the identical positions,


my @identical_pos=$hsp->seq_inds('query','identical');

ADD REPLYlink written 5.7 years ago by J.F.Jiang820
gravatar for Istvan Albert
5.7 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

The default blast output is overly complicated. I would recommend to run blast directly and have it produce a simpler tabular format. 

The file size will be a small fraction of what you get now.


ADD COMMENTlink written 5.7 years ago by Istvan Albert ♦♦ 84k
gravatar for Pavel Senin
5.7 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

Maybe you can split your primers file in parts, then blast and parse those having all outputs combined for the final decision? 

ADD COMMENTlink written 5.7 years ago by Pavel Senin1.9k


Good idea,


1)I have 200 gene to be dealed with the primers design, that's a tough work.

2)the critical thing that make the parsing process too slowly is due to the huge blast out file. And this huge file may be only due to some specific regions. So split the primers in parts to run in different pids may be not the best solution.


ADD REPLYlink written 5.7 years ago by J.F.Jiang820
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2255 users visited in the last hour