Question: each protein with each protein
0
gravatar for natasha.sernova
6.1 years ago by
natasha.sernova3.7k
natasha.sernova3.7k wrote:

Dear colleagues,

I have a lot of proteins. What is the

simplest way to compare each one

of them with all the other proteins?

Probably I have to make a database

and use a program like BlastP.

But how to make sure that I've checked

all the protein pairs I have?

I feel that I have to use "for" in Linux,

but I've failed to finish the cycle expression.

Please. help me!

Many thanks!

N.

blast • 1.5k views
ADD COMMENTlink modified 6.1 years ago by 5heikki8.9k • written 6.1 years ago by natasha.sernova3.7k

You will need to learn how to compare one against one before attempting all vs all. Yes using "for" in linux will be important, but what do you hope to achieve? What kind of comparison? I feel this task is too large, all vs all is not going to be meaningful.

ADD REPLYlink written 6.1 years ago by karl.stamm3.6k
2
gravatar for 5heikki
6.1 years ago by
5heikki8.9k
Finland
5heikki8.9k wrote:

The simplest:

blastp -query all.fasta -subject all.fasta

Better:

makeblastdb -in all.fasta -dbtype prot -title all -out all -hash_index
blastp -query all.fasta -db all -out all-vs-all -outfmt 6

Better still:

makeblastdb -in all.fasta -dbtype prot -title all -out all -hash_index
blastp -query all.fasta -db all -out all-vs-all -outfmt 6 -use_sw_tback -seg yes -soft_masking true -max_target_seqs 20000 -num_threads "MAXonYourSystem"
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
for i in {4..6}
do
    awk '$11 <= 1e-"$i"' all-vs-all > all-vs-all.1e-"$i"
done

And all the other stuff like e.g. sorting for best hits, but really, your question was way too vague to say what should be done. E.g., if the seqs are from NCBI, you could include taxid_map in makeblastdb arguments and make use of that in blastp..

ADD COMMENTlink modified 5 months ago by RamRS27k • written 6.1 years ago by 5heikki8.9k

Dear 5heikki,

That's really great, thank you very much! My goal is to find a correct way to search for the best hit.

It turned out to be rather complicated. "Sorting for best hit" - could you, please, explain to me, how to do it correctly!

My seqs are from NCBI. Honestly, I don't know how to do this second part: "you could include taxid_map in makeblastdb arguments and make use of that in blastp". It would be extremely useful to me to learn this. Please-please-please, give me more details! Many-many thanks!

ADD REPLYlink modified 5 months ago by RamRS27k • written 6.1 years ago by natasha.sernova3.7k

Dear 5heikki,

That's really great, thank you very much! My goal is to find a correct way to search for the best hit.

It turned out to be rather complicated. "Sorting for best hit" - could you, please, explain to me, how to do it correctly!

My seqs are from NCBI. Honestly, I don't know how to do this second part: "you could include taxid_map in makeblastdb arguments and make use of that in blastp". It would be extremely useful to me to learn this. Please-please-please, give me more details!

Many-many thanks!

P.S. I'm not sure where I have to place my answer, so I've just repeat it here.

ADD REPLYlink modified 5 months ago by RamRS27k • written 6.1 years ago by natasha.sernova3.7k

This sorts tabular blast output (-outfmt 6) for best hit based on 1) bitscore, 2) evalue, 3) % identity:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
sort -k1,1 -k12,12gr -k11,11g -k3,3gr blastOut.tsv | sort -u -k1,1 --merge > bestHits

But before that you might have to exclude self hits, something like:

awk '$1 != $2' blastOut.tsv > blastOut.tsv.noSelfHits

As to taxid_map, this file links all protein GIs to NCBI taxon IDs, so you can use it with the taxid_map argument. But then you also need to use the '-parse_seqids' flag in makeblastdb. Then you also need NCBI tax_db in your db dir. I believe all this is documented and explained further in blast manual. Anyway, then when you have this stuff working you can do e.g. -outfmt '6 std sscinames' to have 13 column output with species name as the last column..

ADD REPLYlink modified 5 months ago by RamRS27k • written 6.1 years ago by 5heikki8.9k

5heikki <http: www.biostars.org="" u="" 9490=""/>

2014-06-14 16:56 GMT+04:00 5heikki on Biostar <notifications@biostars.org>:

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by natasha.sernova3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1188 users visited in the last hour