Question

Best hit from BLAST for a particular genomic region

0

Entering edit mode

2.9 years ago

johnny.sf ▴ 20

Hello everybody, is everything okay with all of you?

After months of searching and frustrating scripts in Python that I wrote, I'm here asking for help!

From a TBLASTN output result, I've been trying to select the best hit (chosen as the one with the lowest evalue) from a particular genomic region (within a scaffold, contig etc), but doing manually takes a lot of time!

Is anyone knows a way can speed up this manual process in a automate way?

P.S. the output has over 300.000 results that I pass through an awk command that returns only the results over 750 nucleotide length and puts the evalues in a increasing order that makes selecting Best hits easier.

P.S. for a given subject acc.ver (known as the scaffold, contig etc id) there are a lot of hits per genomic region, so I'm not trying to select the best hit per scaffold, contig etc, but I'm trying to select all the best hits per non overlapped genomic region within each one of the subject acc.ver.

I read an old post from goubert.clement (BLAT: how to select best hit at one genomic position? (queries are repeats)) and Amitm suggested using Bowtie for he's particular problem with repetitive sequences, but I don't know if it can resolve mine...

Thank you in advance :)

genomic blast region best hit evalue • 727 views

ADD COMMENT • link updated 2.9 years ago by Alex Reynolds 35k • written 2.9 years ago by johnny.sf ▴ 20

score 1 · Answer 1 · 2021-06-14

You could use the -outfmt 6 with desired fields to write tab-delimited output (see tblastn -help):

$ tblastn [other options...] -outfmt 6 "sseqid sstart send sseqid evalue sstrand" | sort-bed - > hits.bed

The result hits.bed is a sorted six-column BED file.

The above fields are generally the minimum set needed for defining a strand-oriented genomic interval. You can add additional fields, if you need more detail; see https://www.metagenomics.wiki/tools/blast/blastn-output-format-6 for a full listing of available keywords.

The fifth column contains the E-value (evalue). You can sort on this, or use hits.bed directly with set operation and association tools like bedops and bedmap.

The bedops tool will let you build a list of non-overlapping ("disjoint") genomic intervals by applying the --partition operator on hits, for example. Or you can use your own genomic intervals (sorted with sort-bed).

However they are made, the bedmap tool can map hits from hits.bed to your non-overlapping intervals.

A score selection function can be added to get the best hit per interval. Specifically, using the --max-element operator with bedmap will return the maximum-scoring hit per non-overlapping interval. In the simplest case:

$ bedmap --echo --max-element non-overlapping-intervals.bed hits.bed > answer.bed

Or you can use --echo-map and apply your own sort or selection criteria on the E-value or other parameters. There are several other score and element functions that could also be useful; see bedmap --help or the online documentation for more detail.