How do you get the result of BLAST like this paper?
1
1
Entering edit mode
2.6 years ago
Riku ▴ 80

Hi all.

Now, I’m trying to get an assembly stat with reference to table S1 and figure S1 in this paper.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7266049/

I used the following command to get the result of BLAST for my fasta.

blastx -query /home/nkarim/avenae/trinity_even_out_dir/Trinity.300.longest.fasta \
-db /home/nkarim/blast/db/nr \
-outfmt 6 \
-evalue 1e-3 \
-out /home/nkarim/blast/output/avenae.out

But I don't know how do I get the items;"Top BLASTx-hit species", "Percent of gene with at least one BLASTx hit" and "Percent of gene with at least one GO annotation". Can I get each items using the result of BLAST?

And could you tell me how to get a table like table S1 if you know it?

Thank you very much for your help in advance!

R BLAST RNAseq Trinity • 1.5k views
ADD COMMENT
3
Entering edit mode
2.6 years ago

Since you're blasting with ncbi-NR you're almost there. By changing the blastx output options to outfmt 6 you're receiving tabular output in avenae.out, you can customise the output columns to also receive the species names.

For example, you could run:

blastx -query /home/nkarim/avenae/trinity_even_out_dir/Trinity.300.longest.fasta \
-db /home/nkarim/blast/db/nr \
-outfmt "6 qseqid sseqid sscinames scomnames staxid pident length mismatch gapopen qstart qend sstart send ppos evalue bitscore" \
-evalue 1e-3 \
-out /home/nkarim/blast/output/avenae.out

Now you have all these extra columns, sscinames and scomnames are useful for your end as these contain the scientific name and the common name of your hits.

The laziest way to get the 'top' hit of your queries is to keep only the first hit seen, that normally has the lowest e-value or the highest score :

awk '! a[$1]++' avenae.out > avenae_first_hit_only.out

This falls apart when you have several species with equally good hits, you should still check whether the first hits reported are actually the best hits (manually?). From this out file you also get the percentage of genes with at least one BLAST hit:

wc -l avenae_first_hit_only.out

That will report the number of genes with at least one hit. For GO terms, I'm not sure - you could run GO annotation for your own genes, there are a few online tools for that like PANNZER2 or eggnog-mapper where you can upload your proteins.

For Table S1 (Assembly stats), Trinity has scripts for that, see this older thread Where to find the Trinity assembly statistics?

ADD COMMENT
0
Entering edit mode

I appreciate your help!

Firstly, I had already ran Trinity_Stats.pl. After reading various papers, I thought that these item was not enough in it. So I'm collecting additional data.

If I try it as you say, do I have to re-run BLAST? I would like to avoid re-running if I can, because it takes a long time.

And I tried it as you, an error message was outputed as follows. Was "qseqid" an error option? Or does it involve my environment? I use Anaconda.

blastx -query /home/nkarim/avenae/trinity_even_out_dir/Trinity.300.longest.fasta \
-db /home/nkarim/blast/db/nr \
-outfmt "6 qseqid sseqid sscinames scomnames staxid pident length mismatch gapopen qstart qend sstart send ppos evalue bitscore" \
-evalue 1e-3 \
-out /home/nkarim/blast/output/re_avenae.out

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value: qseqid
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value: qseqid 

I'm begginer at analysis of RNAseq, so sorry if these are an inappropriate question. Thank you.

ADD REPLY
0
Entering edit mode

Hm it looks like your job is removing the quotation marks around the blastx command?

blastx -db test.fasta -query test.fasta -outfmt "6 qseqid sseqid sscinames scomnames staxid pident length mismatch gapopen qstart qend sstart send ppos evalue bitscore"

works fine,

 blastx -db test.fasta -query test.fasta -outfmt 6 qseqid sseqid sscinames scomnames staxid pident length mismatch gapopen qstart qend sstart send ppos evalue bitscore

results in the error message you're seeing. Are you running this command via some job submission script? In which case you might want to use escapes (\") to keep the quotation marks.

You could pull out the common names with a script from the efetch API using your NR hits, might be a bit fiddly thoguh! I hven't done that in a while

ADD REPLY
0
Entering edit mode

Sorry for my late. It took a time for finisheing BLAST.

I forgot using escape(\"), you're right.

I got another problem, but I try to think I will ask question in difference place because the content is different.

Thank you very much for your cooperation! I was helped by your kind advice everytime!

ADD REPLY

Login before adding your answer.

Traffic: 2640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6