Question

parsing too large blast result with bioperl OR other methods?

0

Entering edit mode

9.6 years ago

J.F.Jiang ▴ 910

Hi all,

Recently I am dealing with bunch of genes to design the appropriate primers.

However, it is still hard for me to obtain the homology information of the primers.

For example, I need to design a pair primers for one exon of the gene.

I firstly get all possible primers with predefined length, e.g. 18-30 bps, and then use blastall -p blastn (or megablast) with -e 1 -W 8 to determine whether the primers have homogenous seqs. However, for those >10000 primers, the blast out file was larger than 200M, which requires longer time to parse using Bio::SearchIO module. And sometimes even crash the memory. Moreover, blasting those primer seqs within 18-30 bps are danger because shorter seqs will sometimes fail due to unkown reasons.

Another method is to blast the whole exon regions with parameter -e 0.1 -W 11, however, it will generate huge output and it will take long time to parse the blast file, and to determine whether the primer region falls into homologous part.

Till now, I have not obtained any good method to fix such problem.

If anyone experienced such issue, can you plz tell me how?

Thanks.

2014.9.2

Although we could firstly define those nts belong to repeat regions using repeatMasker,

and then use -F parameter in blast to neglect these regions, those repeat regions, however, will sometimes do not share too much homologous sequences.

This is the method that I can find now, but is not perfect.

Hope someone could provide some suggestions to better improve the results.

blast bioperl • 2.3k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by J.F.Jiang ▴ 910

Ram · Answer 1 · 2014-09-13

1

Entering edit mode

9.6 years ago

Bioinformatics_NewComer ▴ 330

Like Istvan Albert said, have your output in a tabular format. Your blast calling command should somewhat look like:

blastp -query $file_1 -db $file_2 -out outFile -outfmt 6 qseqid sseqid  bitscore evalue pident

blastp or blastn depending on your need.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

Thanks,

The -outfmt parameter can bring the outfile in tab format, however, I can not use Bioperl to parse the file to get to know which nts are identical to the input sequence.

The tabular format will only give how many nts are identical, how many gaps have been inserted. Although generally we could know the query start~end and the hit start~end, the can not be exactly be used to show the identity of part of the input sequence, e.g. from position 20-38.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by J.F.Jiang ▴ 910

0

Entering edit mode

The tabular output has many options and you can customize it to output other fields. For example qlen produces the query length, length produces the alignment length, qseq shows the aligned part of the query sequence etc. that may or may not be sufficient for what you are looking for

But if you need all details of the alignment then it won't work.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Istvan Albert 100k

0

Entering edit mode

Yes, all I want to know is which nts within the query sequence are exactly same to the hit seqs, which is same as the output of using Bio::SearchIO to get the identical positions,

my @identical_pos=$hsp->seq_inds('query','identical');

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by J.F.Jiang ▴ 910

Ram · Answer 2 · 2014-09-12

0

Entering edit mode

9.6 years ago

Istvan Albert 100k

The default blast output is overly complicated. I would recommend to run blast directly and have it produce a simpler tabular format.

The file size will be a small fraction of what you get now.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Istvan Albert 100k

Ram · Answer 3 · 2014-09-12

0

Entering edit mode

9.6 years ago

Pavel Senin ★ 1.9k

Maybe you can split your primers file in parts, then blast and parse those having all outputs combined for the final decision?

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Thanks,

Good idea,

However,

I have 200 gene to be dealed with the primers design, that's a tough work.
The critical thing that make the parsing process too slowly is due to the huge blast out file. And this huge file may be only due to some specific regions. So split the primers in parts to run in different pids may be not the best solution.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by J.F.Jiang ▴ 910