Extracting the Best sequence per genomic region in a genome based on the Best e-value
2
1
Entering edit mode
8 months ago
johnny.sf ▴ 20

Hi everyone,

I'm having trouble trying to filter blast result outputs.

So, I'm using a huge amount of sequences as queries against a certain genome in a local tblastn, which gives me an .txt output. The thing is, I need to extract the best hits, that I've defined as the lowest e-value, for each genomic region that the genome is divided.

I tried sorting in excel with the Filter command, but as the e-value is presented like '1.08e-108', the excel only considers the numbers before the 'e'. Then, in a hypothetical list containing e-values with 1.08e-108, 2.34e-10 and 1.03e-03 values, excel always choose 1.03e-03.

The next thing I tried to do was sorting each genomic region using Pandas, which I transformed the .txt output from blast in a dataframe for better manipulation, but the same thing that happened like in excel.

This way, I'm selecting manually each best hit, but it is taking too much time.

Here's an example of the output:

BrflORs150.1    KN907735.1  23.616  271 186 6   40  299 80310   81092   1.41e-12    75.1

BrflORs150.2    KN907735.1  24.242  264 178 6   41  296 80313   81062   7.55e-09    63.5

BrflORs155.1    KN907735.1  24.825  286 204 4   23  303 80253   81092   1.29e-17    92.4

BrflORs155.1    KN907735.1  22.388  268 188 7   33  290 181025  181798  1.24e-10    70.1

BrflORs155.1    KN907735.1  24.908  273 181 5   41  302 32141   32920   1.84e-10    69.7

BrflORs155.1    KN907735.1  24.254  268 187 7   39  298 191353  192132  2.81e-10    68.9**

BrflORs155.1    KN907685.1  24.739  287 199 8   25  303 37370   38203   9.68e-13    77.0

BrflORs155.1    KN907685.1  25.926  297 189 12  20  301 14077   14919   9.72e-09    63.9

BrflORs155.1    KN909062.1  21.379  290 204 6   23  300 50032   49199   3.01e-12    75.5

BrflORs155.1    KN909062.1  23.132  281 198 5   27  298 33061   32246   7.06e-11    70.9

BrflORs155.1    KN907432.1  25.862  290 181 8   28  300 166293  165475  2.98e-11    72.0

BrflORs155.1    KN906695.1  26.102  295 191 9   22  303 463829  464671  1.27e-10    70.1

BrflORs155.1    KN906695.1  26.689  296 188 8   22  303 485691  486533  3.83e-10    68.6


From those, for example, for the KN907735.1 region, I'd need to select only the query presenting e-value of 1.29e-17, because is the lowest one.

genome e-value parsing Forum • 419 views
0
Entering edit mode

see if this suffices (works with OP data):

datamash -sg 2 min 11 -f < test.txt BrflORs155.1 KN906695.1 26.102 295 191 9 22 303 463829 464671 1.27e-10 70.1 1.27e-10 BrflORs155.1 KN907432.1 25.862 290 181 8 28 300 166293 165475 2.98e-11 72.0 2.98e-11 BrflORs155.1 KN907685.1 24.739 287 199 8 25 303 37370 38203 9.68e-13 77.0 9.68e-13 BrflORs155.1 KN907735.1 24.825 286 204 4 23 303 80253 81092 1.29e-17 92.4 1.29e-17 BrflORs155.1 KN909062.1 21.379 290 204 6 23 300 50032 49199 3.01e-12 75.5 3.01e-12  ADD REPLY 0 Entering edit mode On a different note have you tried the option  -subject_besthit Turn on best hit per subject sequence  That may be all you need if you want to keep one good hit. ADD REPLY 0 Entering edit mode Thank you very much! It worked! You are a life savior :') ADD REPLY 2 Entering edit mode 8 months ago Medhat 9.0k Did you try to use the shell command sort -g. For your file it will be sort -k 11,11 -g yourFile.txt | sort -u -k2,2 ADD COMMENT 1 Entering edit mode 8 months ago johnny.sf ▴ 20 Yes! It worked perfectly, thank you guys! I'm very happy right now hahahaa, it has been consuming me for months! Can I ask just one more question?? For each genomic region (column 2), there are more than one candidate for Best hit, because a genomic region is composed of thousands of nucleotide bases and the candidates I'm looking for have around 750bp. So, for example, in a genomic region KV926207.1, the following result is presented: query acc.ver subject acc.ver % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score BrflORs155.1 KV926207.1 25.510 294 190 9 28 304 130633 131478 8.49e-14 81.6 BrflORs155.1 KV926207.1 24.014 279 191 6 32 298 142379 143188 7.75e-11 72.4 BrflORs157.1 KV926207.1 27.113 284 179 8 46 318 130675 131475 1.70e-17 92.8 BrflORs157.1 KV926207.1 26.259 278 164 7 51 310 173618 174382 1.89e-13 80.1 BrflORs157.1 KV926207.1 23.711 291 195 8 37 317 101835 102656 2.25e-13 80.1 BrflORs157.1 KV926207.1 25.000 296 182 7 34 310 142373 143197 2.31e-12 76.6  From those, the best hits may be the ones below. Because, taking for example queries BrflORs155.1 and BrflORs157.1: both hit the same genomic region that is arround 130600 to 131400, but BrflORs157.1 is chosen as the best hit because presented the lowest evalue. The other regions comprising 101800 to 102600, 142300 to 143000 and 173000 to 174000, may contain other best hits because they are very separated from each other. BrflORs157.1 KV926207.1 27.113 284 179 8 46 318 130675 131475 1.70e-17 92.8 BrflORs157.1 KV926207.1 25.000 296 182 7 34 310 142373 143197 2.31e-12 76.6 BrflORs157.1 KV926207.1 26.259 278 164 7 51 310 173618 174382 1.89e-13 80.1 BrflORs157.1 KV926207.1 23.711 291 195 8 37 317 101835 102656 2.25e-13 80.1  I know now how to get the lowest evalue for a certain evalue, but is there a way I can get other best hits for a certain genomic region (for example, KV926207.1), putting some kind of filter saying something like 'Take the lowest evalue in a genomic region if the subject start and end are higher than 1000bp'? I really appreciated the help. I'm studying right now the sort command on linux so I can get a way to do that :) ADD COMMENT 0 Entering edit mode You should accept @Medhat's comment (which I moved to an answer) then to provide closure (green check mark). ADD REPLY 0 Entering edit mode Yes, we can do so: sort -k 11,11 -g test.tsv | awk 'function abs(v) {return v < 0 ? -v : v}; {if (abs(10-\$9) >= 500) {print}}' I am using 500bp because of the difference between s. start s. end on average is 800, but you can change it as you want.

If we used 800bp the results from your example will be:

 BrflORs155.1   KN907735.1  24.825  286 204 4   23  303 80253   81092   1.29e-17    92.4.
BrflORs155.1    KN907685.1  24.739  287 199 8   25  303 37370   38203   9.68e-13    77.0.
BrflORs155.1    KN909062.1  21.379  290 204 6   23  300 50032   49199   3.01e-12    75.5.
BrflORs155.1    KN907432.1  25.862  290 181 8   28  300 166293  165475  2.98e-11    72.0.
BrflORs155.1    KN909062.1  23.132  281 198 5   27  298 33061   32246   7.06e-11    70.9.
BrflORs155.1    KN906695.1  26.102  295 191 9   22  303 463829  464671  1.27e-10    70.1.
BrflORs155.1    KN906695.1  26.689  296 188 8   22  303 485691  486533  3.83e-10    68.6.
BrflORs155.1    KN907685.1  25.926  297 189 12  20  301 14077   14919   9.72e-09    63.9

0
Entering edit mode

Yes, this way got really close to what I've been doing manually. Thank you for the help! I really appreciate it.