How to explain the blastx output?
1
0
Entering edit mode
5.2 years ago
Seq225 ▴ 110

Hi:

I am running some blastx. I am yet not 100% sure how to explain the results. For example: suppose, the query seq is 5000 nt long. It has 3 ORF (also suppose the AA lengths of the ORFs are 300, 400, and 500) in different frames. Now, how does blastx report the best hit from all these different ORF?

I am sort of intrigued because I am running blastx with these options (to get the best hit in output format 6):

-max_hsps 1 -max_target_seqs 1

So, what might be happening that I am loosing some data. Probably I am getting the best hit (based on bit-score) from only one ORF, and loosing the others.

Am I thinking it right (the result part)? Any idea how to break down the result for the individual ORF?

Thanks!!!

alignment Assembly rna-seq gene genome • 3.8k views
ADD COMMENT
2
Entering edit mode
5.2 years ago

Exactly, your hunch is correct, the parameter settings you apply are not suited for what you want to obtain. The setting you are using now will indeed only report the best match (and only a single one, == in your case only for one single ORF) and not all valid hits on hte query sequence.

Simply drop the two parameters you mention and filter in post-processing for the best match (per locus) will likely give what you are looking for

ADD COMMENT
0
Entering edit mode

Thank you very much! Is there a way to pipe the sequences of the ORFs (of the query) that blastx uses to output? Also, do you know what is the minimum size of the ORF that blastx uses?

Thanks a ton!!

ADD REPLY
1
Entering edit mode

Yes there is actually.

If you are using Blast+ versions you can ask for tabular output eg. -outfmt 6 , for which you can specify the fields to be outputted such as qseq . Have a look at the blast help blastn -help for more details on this.

Keep in mind that it will output the alignable part of the sequence which is not necessarily the same as the biological ORF .

In regard to your min length question : in theory the min length is the word-size , in practice it will be much longer because you will need much larger aligned parts to meet other filtering criteria (such as E-value) and as such it's not actually possible. to put a number on it.

In general I would (in most cases) not advise to use blast to look for ORFs in a sequence. It's good for giving you an indication but it's not capable of providing correct ORFs.

ADD REPLY
0
Entering edit mode

Thank you very much! I have used another program to get all the ORFs. Is have solved my concerns.

Now, I need something similar to qseq. Can I get the entire sequence of the subject? I want the fasta sequences of the top 5 hits for each of my queries. I do not see any options fro that on blast help page.

Thanks again.

ADD REPLY
1
Entering edit mode

No, that you can't get directly from the blast output file.

However, getting those is not much more difficult. if you can get the IDs from the top5 hits, you can use the blast utility blastdbcmd which will extract the full sequences fro those iDs from the blastDB

ADD REPLY
1
Entering edit mode

Ya!!!!! Thanks. Great that I can run command line to get all the sequences. I actually need to write a script with a loop to get sequences against my individual queries. Thanks!!

ADD REPLY
0
Entering edit mode

I have another question:

I want to run blastp against multiple db, and output 5 best hits (-max_target_seqs 5) from each of them (each db hits) . Suppose, I am running blastp against 5 db; I want all 25 hits into one output file.

I see that people use blastdb_aliastool. I just want your opinion. What is the best way?

Thanks a ton, like always!!

ADD REPLY
1
Entering edit mode

Yes, you can use blastdb_alias tool (or even simply specify all the DBs on the cmdline -db db1 db2 db3 ).

However with the -max_target_seqs 5 you can NOT specify that there will be 5 matches from each DB, you will only get the 5 matches over all DBs (the DBs are then considered as a single DB).

If you want to have control per DB you will have to run the blast once for each DB and concatenate the results (tab blastoutput you can simply cat together)

ADD REPLY
0
Entering edit mode

Yes, it works! However, my problem has an extra step. I have multiple queries (1067 to be exact). I want to run them in a loop. Like this:

**
for i in seq_*
do
  for j in *.dmnd
  do
diamond  blastp -p 30   -v    --outfmt 6   --db "$j"     --query "$i"   -e 5     -o "$i".txt  -k 5 --sensitive
  done
  cat *.txt > All.txt | awk '{print $2}'  >  "i".Name.txt
done**

Seems to be not working ( I have pretty low level of bash scripting knowledge too!)

ADD REPLY
1
Entering edit mode

Seems to be not working

Not working as in you get nothing or are there errors?

To debug your code use an echo command. Look t echo $i and echo $j to make sure those file names are coorectly expanded. Then add an echo before diamond and cat in code above (this will only cause the loop to print command lines) to see what command lines are being produced. This will allow you to find out what is wrong with your loop.

ADD REPLY
0
Entering edit mode

Thanks. I think I am putting the two variables "i", and "j" in a wrong way. I have added echo in my commands:

for i in seq_*
do
  for j in *.dmnd
  do
echo diamond  blastp -p 30   -v    --outfmt 6    --db     echo "$j"     --query     echo "$i"   -e 5     -o "$i".txt  -k 5 --sensitive
  done
  echo cat *.txt > All.txt | awk '{print $2}'  >  "i".Name.txt
done

No output files. And, got this in terminal.

*diamond blastp -p 30 -v --outfmt 6 --db echo db1.dmnd --query echo seq_aaab -e 5 -o seq_aaab.txt -k 5 --sensitive

diamond blastp -p 30 -v --outfmt 6 --db echo db2.dmnd --query echo seq_aaab -e 5 -o seq_aaab.txt -k 5 --sensitive

diamond blastp -p 30 -v --outfmt 6 --db echo db1.dmnd --query echo seq_aaac -e 5 -o seq_aaac.txt -k 5 --sensitive

diamond blastp -p 30 -v --outfmt 6 --db echo db2.dmnd --query echo seq_aaac -e 5 -o seq_aaac.txt -k 5 --sensitive

diamond blastp -p 30 -v --outfmt 6 --db echo db1.dmnd --query echo seq_aaad -e 5 -o seq_aaad.txt -k 5 --sensitive

diamond blastp -p 30 -v --outfmt 6 --db echo db2.dmnd --query echo seq_aaad -e 5 -o seq_aaad.txt -k 5 --sensitive

....

.......*

ADD REPLY
1
Entering edit mode

That is to be expected. What genomax suggested was to add the echo command for debugging your script, if you now want to really run it, you have to remove the echo commands again. As you can see it now simply prints the commands it would have executed. Otherwise your script looks OK I think.

I notice you now switched to diamond (in stead of 'normal' blast), any reason for that?

For the -e parameter: I think you need to add it as -e 1e-5 if you want to set an e-value cutoff of 1e-5 (now you have set it to 5 )

ADD REPLY
0
Entering edit mode

Thanks. I got the echo part. I have been using diamond blastp (my db files are huge). However, I can now switch to regular blastp.

Thanks a ton for pointing out the e-value cut-off.

But, when I run the above loop, the output files are empty.

ADD REPLY
1
Entering edit mode

ok.

are all possible output files empty (so also the intermediate ones) or are there that are not empty?

a first problem I see with the loop is that you output all blast results from the different DBs to the same file, so in the best case you will only get the result of the last blast in the loop. I suggest to output to a file with $i and $j combination (eg. ${i}_${j}.txt ) , I would also omit the use of the " (should work without as well).

ADD REPLY
1
Entering edit mode

your final cat is also not correct. now you create one output file for each of the input seq combining all the DB results. if that is what you want, then it's OK.

If you want to combine all input files over all DBs you need to place the cat statement outside the loops. Moreover: that part will never work as you output the cat to a file called All.txt , the awk part behind it will not do anything. If you want the awk statement to work you'll need to go for this:

cat *.txt | awk '{print $2}'  >  "$i".Name.txt

also note I added a $ to the file I.Name.txt

ADD REPLY
0
Entering edit mode

Wow! Great. Thanks a ton. Almost there!!

I only know pieces about bash. So, confuse things.....

What I am using now is:

for i in seq_*
do
  for j in *.dmnd
  do
 diamond  blastp -p 30       --outfmt 6    --db  $j    --query    $i  -e 0.00001  -o  ${i}_${j}.txt  -k 5 --sensitive
  done
    cat *.txt | awk '{print $2}'  | sort  | uniq >  "$i".Name.txt
done

It's working the way I wanted except the cat part. I am probably putting it in a wrong place. What I want is: cat all the .txt output (from each of the db) for each of my query separately. So that I would get the top 5 hits in separate files for each of the query sequence from all the databases.

Can You help me with that?

Thanks!!

ADD REPLY
1
Entering edit mode

the problem there is that for each iteration in your loop the *.txt part will match all files txt files that are present in the current working folder. To avoid that I propose to do something like this:

cat ${i}*.txt | awk '{print $2}'  | sort  | uniq >  "$i".Name.txt

this will limit cat to only use files of each input seq and not all txt files. This will result in a file per input seq file with all the blast results against all DBs.

ADD REPLY
0
Entering edit mode

Thank you so much!!!!!!!!!! It's now working!!

ADD REPLY

Login before adding your answer.

Traffic: 2601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6