Question: Find overlapping blast hits and keep the highest evalue
gravatar for maxime.policarpo
6 months ago by
France, Paris
maxime.policarpo110 wrote:

Hi everyone,

I am performing tblastn with a set of >1000 proteins as queries against a genome.

I am trying to keep every regions of my genome that match a query protein (evalue > 1e-10) but in many cases, 1 genome region will have many hits (several queries in the same region). This is mostly due that my proteins are all similar (same gene family)

For example :

query1 hit scaffold 1 from coordinates 60 to 120 (E = 1e-5)

query2 hit scaffold1 from coordinates 70 to 110 (E = 1e-3)

To filter those results, i would like to find a way to : 1) Find regions with overlapping queries 2) Keep only the best-hit on these regions (based on e-value)

(here, i would keep coordinates 60 to 120 on scaffold 1)

I have a tabular output from blast (outfmt 6) but i can't find an efficient way to apply such filters.

I would prefer something in R or bash but i could try to understand other languages.

Thanks for your help,


blast R genome • 321 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by maxime.policarpo110
gravatar for maxime.policarpo
6 months ago by
France, Paris
maxime.policarpo110 wrote:

Thanks for your answers !

I finally did it using R and the package GenomicRanges as suggested by Macolm.Cook.

I provide the code below :

blast_rslt <- read.table("blast_results.tblastn", header=FALSE, sep="\t")

colnames(blast_rslt) <- c("query", "sseqid", "pident", "length", "mismatch", 
                          "gapopen", "qstart", "qend", "sstart", "send",
                          "evalue", "bitscore")

test <- blast_rslt %>% dplyr::select(sseqid, sstart, send, evalue) 

test <- test %>% mutate(i_start = case_when(
  sstart < send ~ sstart,
  send < sstart ~ send

test <- test %>% mutate(i_end = case_when(
  sstart < send ~ send,
  send < sstart ~ sstart

test <- test %>% mutate(strand = case_when(
  sstart < send ~ "+",
  send < sstart ~ "-"

test <- test %>% select(sseqid, i_start, i_end, evalue, strand)

colnames(test) <- c("seqnames", "start", "end", "evalue", "strand")

test_irange <- test %>% as_granges()

test_disjoin <- reduce(test_irange,with.revmap=TRUE)

list_revmap <-

filtered_data <- c()
for(i in 1:nrow(list_revmap)){
  filtered_data <- c(filtered_data, (slice(list_revmap, i) %>% unlist(use.names=FALSE))[which.min(slice(test,  slice(list_revmap, i) %>% unlist(use.names=FALSE))$evalue)])

Best_hits <- slice(test, filtered_data)
ADD COMMENTlink modified 6 months ago by genomax91k • written 6 months ago by maxime.policarpo110

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work. You can also accept your own answer(as an exception) in this case since you provided actual code which was implied in @Malcom's answer.

ADD REPLYlink written 6 months ago by genomax91k
gravatar for Alex Reynolds
6 months ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

If you reformat your BLAST output into sorted BED files, with five columns (i.e. with the score data in the fifth column), you could use BEDOPS bedmap to group regions by scaffold ("chromosome") and then use its --min-element argument to get the lowest-scoring value on overlapping regions.

To get the outfmt 6-formatted BLAST hit data into BED, you could use awk to select columns from the one-indexed BLAST output file (say, a file called blastHits.mtx), and write them to a zero-indexed, sorted BED file:

$ awk -v FS="\t" -v OFS="\t" '{print($2, ($7-1), $8, $1, $11)}' blastHits.mtx | sort-bed - > blastHits.bed

The blastHits.bed file will be sorted and contain the expectation score in the fifth column.

Let's also say the regions of interest over your genome are collected in a sorted BED file called roi.bed.

You could use the following command in a bash shell to get the lowest-scored element over regions-of-interest:

$ bedmap --echo --min-element --skip-unmapped roi.bed blastHits.bed > answer.bed

The file answer.bed will contain, for each region-of-interest that has an overlapping BLAST hit, the region of interest, and its lowest-scored BLAST hit that overlaps by one or more bases.

If a region-of-interest has no overlapping BLAST hits, it is not included in the output file.

The key is that the names of the scaffolds or chromosomes in both blastHits.bed and roi.bed must be identical, in order for the map step to work correctly. So just check the first column of both files to make sure you're using the same naming scheme.

This gives results for overlaps of one or more bases. If you want more stringency, there are options for that. Take a look at bedmap --help for the overlap options, or read the online documentation for more detail.

ADD COMMENTlink modified 6 months ago • written 6 months ago by Alex Reynolds31k

Just a note that there has been a *-float128 version of BEDOPS tools since v2.4.32, which offers greater precision than the default set of tools:

This "float128" version of the toolkit would be useful if you need to distinguish hits with very low E-values (anything lower than 1e-323).

It's been a while, but I think you could get very low E-values with sufficiently long BLAST queries, so it could be something to be aware of, depending on the makeup of your data and what kinds of E-values you are seeing in the hits.

Another option is to do a -log10 transformation on the E-value, and swap out --min-element for --max-element in the bedmap statement, to instead get the highest scoring element.

Additional precision should not be needed, in this case, because the negative-log transformation brings values into ranges that are easy to work with, with the standard precision toolkit.

ADD REPLYlink modified 6 months ago • written 6 months ago by Alex Reynolds31k

On reading the edit to your question, it looks like all your E-values are greater than 1e-10, so you would not need to use the float128 binaries. The regular binaries would work fine here.

ADD REPLYlink written 6 months ago by Alex Reynolds31k
gravatar for Malcolm.Cook
6 months ago by
kansas, usa
Malcolm.Cook1.1k wrote:

Here's an approach using R:

The GenomicRanges package provides function disjoin with which you can compute all disjoint (non-overlapping) ranges of your hits. When passed optional argument with.revmap=TRUE, the results not only include the desired disjoint ranges, but also the corresponding indices of the hits which overlap them, which you can use to index back into your hits, and minimize on e-value (probably using which.min).

ADD COMMENTlink modified 6 months ago • written 6 months ago by Malcolm.Cook1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1979 users visited in the last hour