Question: Extracting the Best sequence per genomic region in a genome based on the Best e-value
1
gravatar for johnny.sf
9 days ago by
johnny.sf10
Brazil, Curitiba, Universidade Federal do Paraná
johnny.sf10 wrote:

Hi everyone,

I'm having trouble trying to filter blast result outputs.

So, I'm using a huge amount of sequences as queries against a certain genome in a local tblastn, which gives me an .txt output. The thing is, I need to extract the best hits, that I've defined as the lowest e-value, for each genomic region that the genome is divided.

I tried sorting in excel with the Filter command, but as the e-value is presented like '1.08e-108', the excel only considers the numbers before the 'e'. Then, in a hypothetical list containing e-values with 1.08e-108, 2.34e-10 and 1.03e-03 values, excel always choose 1.03e-03.

The next thing I tried to do was sorting each genomic region using Pandas, which I transformed the .txt output from blast in a dataframe for better manipulation, but the same thing that happened like in excel.

This way, I'm selecting manually each best hit, but it is taking too much time.

Here's an example of the output:

BrflORs150.1    KN907735.1  23.616  271 186 6   40  299 80310   81092   1.41e-12    75.1

BrflORs150.2    KN907735.1  24.242  264 178 6   41  296 80313   81062   7.55e-09    63.5

BrflORs155.1    KN907735.1  24.825  286 204 4   23  303 80253   81092   1.29e-17    92.4

BrflORs155.1    KN907735.1  22.388  268 188 7   33  290 181025  181798  1.24e-10    70.1

BrflORs155.1    KN907735.1  24.908  273 181 5   41  302 32141   32920   1.84e-10    69.7

BrflORs155.1    KN907735.1  24.254  268 187 7   39  298 191353  192132  2.81e-10    68.9**

BrflORs155.1    KN907685.1  24.739  287 199 8   25  303 37370   38203   9.68e-13    77.0

BrflORs155.1    KN907685.1  25.926  297 189 12  20  301 14077   14919   9.72e-09    63.9

BrflORs155.1    KN909062.1  21.379  290 204 6   23  300 50032   49199   3.01e-12    75.5

BrflORs155.1    KN909062.1  23.132  281 198 5   27  298 33061   32246   7.06e-11    70.9

BrflORs155.1    KN907432.1  25.862  290 181 8   28  300 166293  165475  2.98e-11    72.0

BrflORs155.1    KN906695.1  26.102  295 191 9   22  303 463829  464671  1.27e-10    70.1

BrflORs155.1    KN906695.1  26.689  296 188 8   22  303 485691  486533  3.83e-10    68.6

From those, for example, for the KN907735.1 region, I'd need to select only the query presenting e-value of 1.29e-17, because is the lowest one.

parsing e-value forum genome • 185 views
ADD COMMENTlink modified 7 days ago • written 9 days ago by johnny.sf10

see if this suffices (works with OP data):

$ datamash  -sg 2 min 11 -f < test.txt

BrflORs155.1    KN906695.1      26.102  295     191     9       22      303     463829  464671  1.27e-10        70.1    1.27e-10
BrflORs155.1    KN907432.1      25.862  290     181     8       28      300     166293  165475  2.98e-11        72.0    2.98e-11
BrflORs155.1    KN907685.1      24.739  287     199     8       25      303     37370   38203   9.68e-13        77.0    9.68e-13
BrflORs155.1    KN907735.1      24.825  286     204     4       23      303     80253   81092   1.29e-17        92.4    1.29e-17
BrflORs155.1    KN909062.1      21.379  290     204     6       23      300     50032   49199   3.01e-12        75.5    3.01e-12
ADD REPLYlink written 9 days ago by cpad011215k

On a different note have you tried the option

 -subject_besthit
   Turn on best hit per subject sequence

That may be all you need if you want to keep one good hit.

ADD REPLYlink written 8 days ago by GenoMax96k

Thank you very much! It worked! You are a life savior :')

ADD REPLYlink written 7 days ago by johnny.sf10
2
gravatar for Medhat
9 days ago by
Medhat8.9k
Texas
Medhat8.9k wrote:

Did you try to use the shell command sort -g.
For your file it will be sort -k 11,11 -g yourFile.txt | sort -u -k2,2

ADD COMMENTlink modified 9 days ago • written 9 days ago by Medhat8.9k
1
gravatar for johnny.sf
8 days ago by
johnny.sf10
Brazil, Curitiba, Universidade Federal do Paraná
johnny.sf10 wrote:

Yes! It worked perfectly, thank you guys! I'm very happy right now hahahaa, it has been consuming me for months!

Can I ask just one more question?? For each genomic region (column 2), there are more than one candidate for Best hit, because a genomic region is composed of thousands of nucleotide bases and the candidates I'm looking for have around 750bp. So, for example, in a genomic region KV926207.1, the following result is presented:

query acc.ver subject acc.ver % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score

BrflORs155.1    KV926207.1  25.510  294 190 9   28  304 130633  131478  8.49e-14    81.6

BrflORs155.1    KV926207.1  24.014  279 191 6   32  298 142379  143188  7.75e-11    72.4

BrflORs157.1    KV926207.1  27.113  284 179 8   46  318 130675  131475  1.70e-17    92.8

BrflORs157.1    KV926207.1  26.259  278 164 7   51  310 173618  174382  1.89e-13    80.1

BrflORs157.1    KV926207.1  23.711  291 195 8   37  317 101835  102656  2.25e-13    80.1

BrflORs157.1    KV926207.1  25.000  296 182 7   34  310 142373  143197  2.31e-12    76.6

From those, the best hits may be the ones below. Because, taking for example queries BrflORs155.1 and BrflORs157.1: both hit the same genomic region that is arround 130600 to 131400, but BrflORs157.1 is chosen as the best hit because presented the lowest evalue.

The other regions comprising 101800 to 102600, 142300 to 143000 and 173000 to 174000, may contain other best hits because they are very separated from each other.

BrflORs157.1    KV926207.1  27.113  284 179 8   46  318 130675  131475  1.70e-17    92.8

BrflORs157.1    KV926207.1  25.000  296 182 7   34  310 142373  143197  2.31e-12    76.6 

BrflORs157.1    KV926207.1  26.259  278 164 7   51  310 173618  174382  1.89e-13    80.1

BrflORs157.1    KV926207.1  23.711  291 195 8   37  317 101835  102656  2.25e-13    80.1

I know now how to get the lowest evalue for a certain evalue, but is there a way I can get other best hits for a certain genomic region (for example, KV926207.1), putting some kind of filter saying something like 'Take the lowest evalue in a genomic region if the subject start and end are higher than 1000bp'?

I really appreciated the help. I'm studying right now the sort command on linux so I can get a way to do that :)

ADD COMMENTlink written 8 days ago by johnny.sf10

You should accept @Medhat's comment (which I moved to an answer) then to provide closure (green check mark).

ADD REPLYlink written 8 days ago by GenoMax96k

Yes, we can do so: sort -k 11,11 -g test.tsv | awk 'function abs(v) {return v < 0 ? -v : v}; {if (abs($10-$9) >= 500) {print}}' I am using 500bp because of the difference between s. start s. end on average is 800, but you can change it as you want.

If we used 800bp the results from your example will be:

 BrflORs155.1   KN907735.1  24.825  286 204 4   23  303 80253   81092   1.29e-17    92.4.
BrflORs155.1    KN907685.1  24.739  287 199 8   25  303 37370   38203   9.68e-13    77.0. 
BrflORs155.1    KN909062.1  21.379  290 204 6   23  300 50032   49199   3.01e-12    75.5.
BrflORs155.1    KN907432.1  25.862  290 181 8   28  300 166293  165475  2.98e-11    72.0.
BrflORs155.1    KN909062.1  23.132  281 198 5   27  298 33061   32246   7.06e-11    70.9.
BrflORs155.1    KN906695.1  26.102  295 191 9   22  303 463829  464671  1.27e-10    70.1.
BrflORs155.1    KN906695.1  26.689  296 188 8   22  303 485691  486533  3.83e-10    68.6.
BrflORs155.1    KN907685.1  25.926  297 189 12  20  301 14077   14919   9.72e-09    63.9
ADD REPLYlink modified 8 days ago • written 8 days ago by Medhat8.9k

Yes, this way got really close to what I've been doing manually. Thank you for the help! I really appreciate it.

ADD REPLYlink written 7 days ago by johnny.sf10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1527 users visited in the last hour
_