Question: How To Convert Blast Results To Gff
16
gravatar for Michael Barton
8.4 years ago by
Michael Barton1.8k
Akron, Ohio, United States
Michael Barton1.8k wrote:

I'd like to visualise the results of a BLAST search in a genome browser. Is there a simple way to get the results in GFF format without having to write a parser myself?

gff format blast • 16k views
ADD COMMENTlink modified 2.5 years ago by cmdcolin780 • written 8.4 years ago by Michael Barton1.8k
7
gravatar for Darked89
8.4 years ago by
Darked894.2k
Barcelona, Spain
Darked894.2k wrote:

Start with tabulated blast output myfile.blast.out. Then check two-liners from:

http://bergman-lab.blogspot.com/2009/12/ncbi-blast-tabular-output-format-fields.html

Few lines tooutput proper gff are missing, but you may either go for minimalistic gff or try to encode everything in column 9. Also you may try validating your gff3 here:

http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online

NOTE: The blog linked above does not seem to exist anymore, here is the content of it from the wayback machine:

NCBI Blast Tabular output format fields

Certainly, with the new NCBI Blast+ tools, you won't need this anymore, but as long as we are sticking with the old blastall programm with its horrible documentation, I keep forgetting the format of the BLAST tabular reports. Tabular format is created when you specify "-m 8". This is the most useful format for parsing blast yourself without having to learn strange libraries like BioPerl, BioJava, BioPython or BioErlang (doesn't this exist yet, Mike?)

So here is the meaning of the fields:

queryId, subjectId, percIdentity, alnLength, mismatchCount, 
gapOpenCount, queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore

Parsing is then simple

Python:

for line in open("myfile.blast"):
    (queryId, subjectId, percIdentity, alnLength, mismatchCount, 
    gapOpenCount, queryStart, queryEnd, subjectStart, 
    subjectEnd, eVal, bitScore) = line.split("\t")

Perl:

while (<>) {
   ($queryId, $subjectId, $percIdentity, $alnLength, $mismatchCount, 
   $gapOpenCount, $queryStart, $queryEnd, $subjectStart, 
   $subjectEnd, $eVal, $bitScore) = split(/\t/)
}
ADD COMMENTlink modified 5.9 years ago by Istvan Albert ♦♦ 77k • written 8.4 years ago by Darked894.2k
2

The blog post referred to above has moved to:

http://bergmanlab.smith.man.ac.uk/?p=41

ADD REPLYlink written 7.8 years ago by Casey Bergman17k

Updated again :) http://bergmanlab.genetics.uga.edu/?p=41

ADD REPLYlink written 14 months ago by cmdcolin780

Using the examples this was relatively simple to do. The GFF3 validator helped too.

ADD REPLYlink written 8.4 years ago by Michael Barton1.8k
5
gravatar for Michael Dondrup
8.4 years ago by
Bergen, Norway
Michael Dondrup44k wrote:

http://jperl.googlecode.com/svn-history/r16/trunk/Blast2Gff.pl

You can use the script Pierre found swith a slight modification, actually it is a bit crude and does no real error checking but it works. The error is it does not work if the blast file has a header like this:

# BLASTN 2.2.21 [Jun-14-2009]
# Query: 16383 sequences
# Database: genomedata/GenomeOfDeath.fas
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
GeneOfDeath  GenomeOfDeath  100.00  295     0       0       1       295     152626  152920  4e-168   585

So, one should filter out lines beginning with "#" and it does no harm to skip lines which are empty or contain only white spaces.

So edit the file Blast2Gff.pl: in line 149 add:

 next if (/^\#/ || /^\s*$/); # filter comments and empty lines

Such that this part looks like below, then try again.

while (<BLASTIN>)
{

  next if (/^\#/ || /^\s*$/); # filter comments and empty lines

$HitNum++;

my ($QryId, $SubId, $PID, $Len, 
    $MisMatch, $GapOpen, 
    $QStart,$QEnd, $SStart, $SEnd,
    $EVal, $BitScore) = split(/\t/);
ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Michael Dondrup44k
1
gravatar for Giovanni M Dall'Olio
8.4 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

Have you tried these scripts: http://gmod.org/wiki/Load_BLAST_Into_Chado, http://www.bioperl.org/pipermail/bioperl-l/2002-November/010223.html ??

maybe the PSL format is better to represent an alignment. You can also look at the BED format so later you can play with BedTools

ADD COMMENTlink written 8.4 years ago by Giovanni M Dall'Olio26k

I don't have bioperl on my computer and the documentation states that installing bioperl is "non-trivial" which is not encouraging. Perhaps it would be simpler just to write a parser myself.

ADD REPLYlink written 8.4 years ago by Michael Barton1.8k
1
gravatar for Malcolm.Cook
2.6 years ago by
Malcolm.Cook800
kansas, usa
Malcolm.Cook800 wrote:

If your genome browser of choice displays psl tracks (as does IGV, wink wink), the save your blast results as XML and convert them in two steps to psl using the Jim Kent utility blastXmlToPsl

Here's an example call:

blastXmlToPsl -pslx -tName=Hit_id -qName=query-def0  blast_hits.xml   blast_hits.psl  

Else, take it another step, and generate a .bed, with 

pstToBed blast_hits.psl blast_hits.bed

 

ADD COMMENTlink written 2.6 years ago by Malcolm.Cook800
0
gravatar for nuria
4.6 years ago by
nuria0
nuria0 wrote:

Hi! I tried something similar to what Darked89 explained, but I have a problem:

The subject end is not always larger than the subject start (plus/minus), and bedtools complains:

"Error: malformed BED entry at line 3. Start was greater than end. Exiting."

Does anyone know a way to change start and end values when end is smaller?

Thank you

ADD COMMENTlink written 4.6 years ago by nuria0

This should be rather a comment, not an answer to the question, to solve the issue, swap start and stop if start > stop: by some code like: ($start, $stop) = sort ($start, stop)

ADD REPLYlink written 4.6 years ago by Michael Dondrup44k
0
gravatar for nuria
4.6 years ago by
nuria0
nuria0 wrote:

I solved my problem (wasn't too complicate) so I'm posting my solution here. In case someonle else want to turn a blast tabular output into a simple bed file, but has problems with subject start values larger than end values.

I mean turn this:

scaffold01109 994 1071 HWUSI-EAS1662:3:114:11269:10361#5/2

scaffold01109 1018 1071 HWUSI-EAS1662:3:1:4638:10404#5/1

scaffold01109 1071 1009 HWUSI-EAS1662:3:23:15834:10646#5/1

scaffold01109 2289 2245 HWUSI-EAS1662:3:24:9622:2914#5/1

(I got this file doing: cat blast+tabularfile.input | awk '{print $2,$9,$10,$1}'> almostbedfile.output)

into this:

scaffold01109 994 1071 HWUSI-EAS1662:3:114:11269:10361#5/2

scaffold01109 1018 1071 HWUSI-EAS1662:3:1:4638:10404#5/1

scaffold01109 1009 1071 HWUSI-EAS1662:3:23:15834:10646#5/1

scaffold01109 2245 2289 HWUSI-EAS1662:3:24:9622:2914#5/1

cat input.tsv | awk '{if ($3<$2) print $1,$3,$2,$4; else if ($3>$2)print $0}' > output.bed

But remember to sort it before use it!

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by nuria0
0
gravatar for cmdcolin
2.5 years ago by
cmdcolin780
United States
cmdcolin780 wrote:

The tool bp_search2gff.pl can do this fairly well.

Install bioperl (`cpanm --notest BioPerl`)

Then run blast and the bp_search2gff.pl

    blastn  -query query.fa -db blast.fa > blast.txt
    bp_search2gff.pl --input blast.txt --addid --version 3 --type hit

 

If you use the --match option it can group HSP too

ADD COMMENTlink written 2.5 years ago by cmdcolin780
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 558 users visited in the last hour