Question: Presence/absence of proteins blastp nr database
0
gravatar for freekwilly
2.7 years ago by
freekwilly0
freekwilly0 wrote:

Hi all,

I am a molecular biologist in the first place, but came into touch with bioinformatic challenges more often recently. For simple tasks I've been using command line blast+ already and now I am wondering whether it is possible to do the following with blast+:

Let's say I have a random protein sequence in a file named RP1.faa and I want to check each bacterial strain in the refseq nr database on whether RP1 ist present or absent by an e-value threshold of e.g. 1e-05. And I want a table which shows me only the best hit for each strain in the first column and e.g. e-value, bit-score and so on in the following columns. And in each row I want the next strain to be shown, even if no blast hit was found In the end I am imagining a matrix which shows me whether a certain protein is present or absent.

Do you know if it is possible? Thanks a lot!

blast alignment genome • 894 views
ADD COMMENTlink written 2.7 years ago by freekwilly0
1

You can show the presence of each straing by, doing a tab output with -outfmt 6, and provide the e-value threshold to the blast command, as wells as use -max_target_seqs 1 to show the best alignment for each strain. What you will have to do is, conclude how against many/which strains RP1 is absent. Someone may have a better approach.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by st.ph.n2.5k

In addition you will have to filter/restrict the results for bacteria (in theory, taxid:2).

ADD REPLYlink written 2.7 years ago by genomax80k

I am trying to obtain a similar output but using the rpsblast function in the command line. Do you know if there is a way to not only display just the best hit but also only "specific hits" as is given as a result in the web-based version?

ADD REPLYlink written 2.7 years ago by lauren.manck10

Hey, thanks for the fast answers. But unfortunately it is not yet what I was looking for. When using -max_target_seqs 1 I'll get only one hit in total, not one hit for every entry in my database.

ADD REPLYlink written 2.7 years ago by freekwilly0

not one hit for every entry in my database

Is the reference to nr or to your own sequence data? You can always flip the search around. It is not necessary that there will be a hit found for every sequence being searched in both of those cases.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax80k

For now it still refers to the "nr" database. I'm am currently trying to subset it into a more precise and useful database

ADD REPLYlink written 2.7 years ago by freekwilly0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1220 users visited in the last hour