Question

Extracting the Best sequence per genomic region in a genome based on the Best e-value

1

Entering edit mode

4.4 years ago

johnny.sf ▴ 20

Hi everyone,

I'm having trouble trying to filter blast result outputs.

So, I'm using a huge amount of sequences as queries against a certain genome in a local tblastn, which gives me an .txt output. The thing is, I need to extract the best hits, that I've defined as the lowest e-value, for each genomic region that the genome is divided.

I tried sorting in excel with the Filter command, but as the e-value is presented like '1.08e-108', the excel only considers the numbers before the 'e'. Then, in a hypothetical list containing e-values with 1.08e-108, 2.34e-10 and 1.03e-03 values, excel always choose 1.03e-03.

The next thing I tried to do was sorting each genomic region using Pandas, which I transformed the .txt output from blast in a dataframe for better manipulation, but the same thing that happened like in excel.

This way, I'm selecting manually each best hit, but it is taking too much time.

Here's an example of the output:

BrflORs150.1    KN907735.1  23.616  271 186 6   40  299 80310   81092   1.41e-12    75.1
BrflORs150.2    KN907735.1  24.242  264 178 6   41  296 80313   81062   7.55e-09    63.5
BrflORs155.1    KN907735.1  24.825  286 204 4   23  303 80253   81092   1.29e-17    92.4
BrflORs155.1    KN907735.1  22.388  268 188 7   33  290 181025  181798  1.24e-10    70.1
BrflORs155.1    KN907735.1  24.908  273 181 5   41  302 32141   32920   1.84e-10    69.7
BrflORs155.1    KN907735.1  24.254  268 187 7   39  298 191353  192132  2.81e-10    68.9**
BrflORs155.1    KN907685.1  24.739  287 199 8   25  303 37370   38203   9.68e-13    77.0
BrflORs155.1    KN907685.1  25.926  297 189 12  20  301 14077   14919   9.72e-09    63.9
BrflORs155.1    KN909062.1  21.379  290 204 6   23  300 50032   49199   3.01e-12    75.5
BrflORs155.1    KN909062.1  23.132  281 198 5   27  298 33061   32246   7.06e-11    70.9
BrflORs155.1    KN907432.1  25.862  290 181 8   28  300 166293  165475  2.98e-11    72.0
BrflORs155.1    KN906695.1  26.102  295 191 9   22  303 463829  464671  1.27e-10    70.1
BrflORs155.1    KN906695.1  26.689  296 188 8   22  303 485691  486533  3.83e-10    68.6

From those, for example, for the KN907735.1 region, I'd need to select only the query presenting e-value of 1.29e-17, because is the lowest one.

genome e-value parsing • 1.8k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 4.4 years ago by johnny.sf ▴ 20

0

Entering edit mode

see if this suffices (works with OP data):

$ datamash  -sg 2 min 11 -f < test.txt

BrflORs155.1    KN906695.1      26.102  295     191     9       22      303     463829  464671  1.27e-10        70.1    1.27e-10
BrflORs155.1    KN907432.1      25.862  290     181     8       28      300     166293  165475  2.98e-11        72.0    2.98e-11
BrflORs155.1    KN907685.1      24.739  287     199     8       25      303     37370   38203   9.68e-13        77.0    9.68e-13
BrflORs155.1    KN907735.1      24.825  286     204     4       23      303     80253   81092   1.29e-17        92.4    1.29e-17
BrflORs155.1    KN909062.1      21.379  290     204     6       23      300     50032   49199   3.01e-12        75.5    3.01e-12

ADD REPLY • link 4.4 years ago by cpad0112 21k

0

Entering edit mode

On a different note have you tried the option

 -subject_besthit
   Turn on best hit per subject sequence

That may be all you need if you want to keep one good hit.

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

Thank you very much! It worked! You are a life savior :')

ADD REPLY • link 4.4 years ago by johnny.sf ▴ 20

score 2 · Accepted Answer · 2021-02-16

2

Entering edit mode

4.4 years ago

Medhat 9.8k

Did you try to use the shell command sort -g.
For your file it will be sort -k 11,11 -g yourFile.txt | sort -u -k2,2

ADD COMMENT • link 4.4 years ago by Medhat 9.8k

score 1 · Accepted Answer · 2021-02-17

Yes! It worked perfectly, thank you guys! I'm very happy right now hahahaa, it has been consuming me for months!

Can I ask just one more question?? For each genomic region (column 2), there are more than one candidate for Best hit, because a genomic region is composed of thousands of nucleotide bases and the candidates I'm looking for have around 750bp. So, for example, in a genomic region KV926207.1, the following result is presented:

query acc.ver subject acc.ver % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score

BrflORs155.1    KV926207.1  25.510  294 190 9   28  304 130633  131478  8.49e-14    81.6

BrflORs155.1    KV926207.1  24.014  279 191 6   32  298 142379  143188  7.75e-11    72.4

BrflORs157.1    KV926207.1  27.113  284 179 8   46  318 130675  131475  1.70e-17    92.8

BrflORs157.1    KV926207.1  26.259  278 164 7   51  310 173618  174382  1.89e-13    80.1

BrflORs157.1    KV926207.1  23.711  291 195 8   37  317 101835  102656  2.25e-13    80.1

BrflORs157.1    KV926207.1  25.000  296 182 7   34  310 142373  143197  2.31e-12    76.6

From those, the best hits may be the ones below. Because, taking for example queries BrflORs155.1 and BrflORs157.1: both hit the same genomic region that is arround 130600 to 131400, but BrflORs157.1 is chosen as the best hit because presented the lowest evalue.

The other regions comprising 101800 to 102600, 142300 to 143000 and 173000 to 174000, may contain other best hits because they are very separated from each other.

BrflORs157.1    KV926207.1  27.113  284 179 8   46  318 130675  131475  1.70e-17    92.8

BrflORs157.1    KV926207.1  25.000  296 182 7   34  310 142373  143197  2.31e-12    76.6 

BrflORs157.1    KV926207.1  26.259  278 164 7   51  310 173618  174382  1.89e-13    80.1

BrflORs157.1    KV926207.1  23.711  291 195 8   37  317 101835  102656  2.25e-13    80.1

I know now how to get the lowest evalue for a certain evalue, but is there a way I can get other best hits for a certain genomic region (for example, KV926207.1), putting some kind of filter saying something like 'Take the lowest evalue in a genomic region if the subject start and end are higher than 1000bp'?

I really appreciated the help. I'm studying right now the sort command on linux so I can get a way to do that :)