To get number of hits from blastp output file
1
0
Entering edit mode
3.5 years ago
fec2 ▴ 40

Hi all,

I have multiple blastp output (format 6) in a directory, I wish to calculate the number of hits with sequence identity of more than 40% for each output file, therefore, I have tried:

for i in *.tsv; do awk '$3>=40'$i | wc -l; done

However, this command only give me a list of number in the terminal without matching it with the blastp output, any modification that I can do so that I know the number belongs to which blastp output file? Thank you.

genome • 1.1k views
5
Entering edit mode
3.5 years ago
AK ★ 2.1k

Hi fec2,

Try:

for i in *.tsv; do echo -ne "${i}\t" && awk '$3>=40{print $2}'${i} | sort -u | wc -l; done


It should returns something like:

blast_out1.tsv  217
blast_out2.tsv  172
blast_out3.tsv  215

0
Entering edit mode

Thank you very much! Is it possible to get the result in an output file?

0
Entering edit mode

You're welcome. Try this:

(for i in *.tsv; do echo -ne "${i}\t" && awk '$3>=40{print $2}'${i} | sort -u | wc -l; done) > output.txt

0
Entering edit mode

The command working well, thank you again!

0
Entering edit mode

Hi fec2,

1
Entering edit mode

Hi thanks for your comment, I have accepted the answer. Thank you.

0
Entering edit mode

Hi,

May I know where can I find the manual for the meaning of all these command? I am new in this field and but have no clue where to find all these information. Really appreciate your help.

1
Entering edit mode

Hello fec2,

You can use man echo, man awk, man sort, and man wc. I'd recommend "Ch3. Remedial Unix Shell" and "Ch7. Unix Data Tools" in the book: Bioinformatics Data Skills by Vince Buffalo. :-)