How To Convert Blast Results To Gff
7
21
Entering edit mode
14.1 years ago
Michael Barton ★ 1.9k

I'd like to visualise the results of a BLAST search in a genome browser. Is there a simple way to get the results in GFF format without having to write a parser myself?

blast gff format • 25k views
ADD COMMENT
8
Entering edit mode
14.1 years ago
Darked89 4.6k

Start with tabulated blast output myfile.blast.out. Then check two-liners from:

http://bergman-lab.blogspot.com/2009/12/ncbi-blast-tabular-output-format-fields.html

Few lines tooutput proper gff are missing, but you may either go for minimalistic gff or try to encode everything in column 9. Also you may try validating your gff3 here:

http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online

NOTE: The blog linked above does not seem to exist anymore, here is the content of it from the wayback machine:

NCBI Blast Tabular output format fields

Certainly, with the new NCBI Blast+ tools, you won't need this anymore, but as long as we are sticking with the old blastall programm with its horrible documentation, I keep forgetting the format of the BLAST tabular reports. Tabular format is created when you specify -m 8. This is the most useful format for parsing blast yourself without having to learn strange libraries like BioPerl, BioJava, BioPython or BioErlang (doesn't this exist yet, Mike?)

So here is the meaning of the fields:

queryId, subjectId, percIdentity, alnLength, mismatchCount, 
gapOpenCount, queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore

Parsing is then simple

Python:

for line in open("myfile.blast"):
    (queryId, subjectId, percIdentity, alnLength, mismatchCount, 
    gapOpenCount, queryStart, queryEnd, subjectStart, 
    subjectEnd, eVal, bitScore) = line.split("\t")

Perl:

while (<>) {
   ($queryId, $subjectId, $percIdentity, $alnLength, $mismatchCount, 
   $gapOpenCount, $queryStart, $queryEnd, $subjectStart, 
   $subjectEnd, $eVal, $bitScore) = split(/\t/)
}
ADD COMMENT
2
Entering edit mode

The blog post referred to above has moved to:

http://bergmanlab.smith.man.ac.uk/?p=41

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Using the examples this was relatively simple to do. The GFF3 validator helped too.

ADD REPLY
5
Entering edit mode
14.1 years ago
Michael 54k

http://jperl.googlecode.com/svn-history/r16/trunk/Blast2Gff.pl

You can use the script Pierre found swith a slight modification, actually it is a bit crude and does no real error checking but it works. The error is it does not work if the blast file has a header like this:

# BLASTN 2.2.21 [Jun-14-2009]
# Query: 16383 sequences
# Database: genomedata/GenomeOfDeath.fas
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
GeneOfDeath  GenomeOfDeath  100.00  295     0       0       1       295     152626  152920  4e-168   585

So, one should filter out lines beginning with "#" and it does no harm to skip lines which are empty or contain only white spaces. So edit the file Blast2Gff.pl: in line 149 add:

next if (/^\#/ || /^\s*$/); # filter comments and empty lines

Such that this part looks like below, then try again.

while (<BLASTIN>)
{
  next if (/^\#/ || /^\s*$/); # filter comments and empty lines

$HitNum++;

my ($QryId, $SubId, $PID, $Len, 
    $MisMatch, $GapOpen, 
    $QStart,$QEnd, $SStart, $SEnd,
    $EVal, $BitScore) = split(/\t/);
ADD COMMENT
3
Entering edit mode
8.3 years ago
cmdcolin ★ 3.8k

The tool bp_search2gff.pl can do this fairly well.

Install bioperl (cpanm --notest BioPerl)

Then run blast and the bp_search2gff.pl

blastn -query query.fa -db blast.fa > blast.txt
bp_search2gff.pl --input blast.txt --addid --version 3 --type hit

If you use the --match option it can group HSP too

ADD COMMENT
1
Entering edit mode
14.1 years ago

Have you tried these scripts??

Maybe the PSL format is better to represent an alignment. You can also look at the BED format so later you can play with BedTools

ADD COMMENT
0
Entering edit mode

I don't have bioperl on my computer and the documentation states that installing bioperl is "non-trivial" which is not encouraging. Perhaps it would be simpler just to write a parser myself.

ADD REPLY
1
Entering edit mode
8.4 years ago
Malcolm.Cook ★ 1.5k

If your genome browser of choice displays psl tracks (as does IGV, wink wink), the save your blast results as XML and convert them in two steps to psl using the Jim Kent utility blastXmlToPsl

Here's an example call:

blastXmlToPsl -pslx -tName=Hit_id -qName=query-def0  blast_hits.xml   blast_hits.psl

Else, take it another step, and generate a .bed, with

pstToBed blast_hits.psl blast_hits.bed
ADD COMMENT
0
Entering edit mode
10.4 years ago
nuria • 0

Hi! I tried something similar to what Darked89 explained, but I have a problem:

The subject end is not always larger than the subject start (plus/minus), and bedtools complains:

"Error: malformed BED entry at line 3. Start was greater than end. Exiting."

Does anyone know a way to change start and end values when end is smaller?

Thank you

ADD COMMENT
0
Entering edit mode

This should be rather a comment, not an answer to the question, to solve the issue, swap start and stop if start > stop: by some code like: ($start, $stop) = sort ($start, stop)

ADD REPLY
0
Entering edit mode
10.4 years ago
nuria • 0

I solved my problem (wasn't too complicate) so I'm posting my solution here.

In case someonle else want to turn a blast tabular output into a simple bed file, but has problems with subject start values larger than end values.

I mean turn this:

scaffold01109    994    1071    HWUSI-EAS1662:3:114:11269:10361#5/2
scaffold01109    1018    1071    HWUSI-EAS1662:3:1:4638:10404#5/1
scaffold01109    1071    1009    HWUSI-EAS1662:3:23:15834:10646#5/1
scaffold01109    2289    2245    HWUSI-EAS1662:3:24:9622:2914#5/1

I got this file doing:

cat blast+tabularfile.input | awk '{print $2,$9,$10,$1}'> almostbedfile.output

into this:

scaffold01109   994     1071    HWUSI-EAS1662:3:114:11269:10361#5/2
scaffold01109   1018    1071    HWUSI-EAS1662:3:1:4638:10404#5/1
scaffold01109 1009 1071 HWUSI-EAS1662:3:23:15834:10646#5/1
scaffold01109 2245 2289 HWUSI-EAS1662:3:24:9622:2914#5/1
cat input.tsv | awk '{if ($3<$2) print  $1,$3,$2,$4; else if ($3>$2)print $0}' > output.bed

But remember to sort it before use it!

ADD COMMENT

Login before adding your answer.

Traffic: 3829 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6