Too many blast hits.
2
0
Entering edit mode
4.7 years ago

When I'm running standalone blast with my genome against the phage database I'm getting an output file with 5000 hits. I've sorted out the excel sheet with evalue lower than 0.01 query cover >= 70℅ and ℅id >= 30 . Now ,after doing all this my output data reduced to 289 hits. Well, I don't think that my genome will have 289 phage genes. How to go further from here ? Please help me ...

blastall phage • 1.6k views
2
Entering edit mode

What blast program did you use? Just a normal blastn? You seem to be under the impression that each blast hit corresponds to a gene match. This isn't necessarily the case. What you've actually got is 289 instances of aligned sequences which have an ID >=30% (which if you did nucleotide blast is really low FYI). And a query cover of >=70% - this might be fair, but if I was looking for genes I'd probably go higher. An E-value of 0.01 is still pretty high as well I'd say.

Additionally, you may have duplicate matches if a given genome is in NCBI multiple times (which they often are), so you may also want to filter out unique hits?

It's not very clear what you're actually trying to achieve, so maybe you could expand your question further.

0
Entering edit mode

A few hints:

• You really need to reveal what organism you are dealing with, at least the domain.
• You are jumping to conclusions. You seem to be of the conviction that there are "too many" hits, and then in your question aim at rectifying the observation a posteriori to fit subjective assumptions. That can push an answer into a certain direction. Instead you should try to find an explanation in form of a hypothesis derived from the observation: e.g.:
• It would be no surprise for bacteria to contain phage-like genes, as bacteria are hosts of phages.
• There are many phage genomes in the db, a single conserved protein can generate many hits.
• My genome sequence is 'contaminated' with phage sequences, e.g. phiX which is often used as internal quality control
0
Entering edit mode
4.7 years ago
fishgolden ▴ 450

What you are doing (what blast can mainly do) is "Searching homologous genes". With my understanding, what you want to do is "Gene prediction". I recommend you to google with the term "gene" "prediction" or such.

If you want to do something with your current result, I think many genes are mapped on the same region on your genomes, filter out such overlapped hits.

0
Entering edit mode

Ok so how do I sort it now? Should I delete the duplicated subject id?

0
Entering edit mode

"duplicated subject id" ? Does it mean that you searched against database of "phage genome" not "phage mRNA, protein or genes"? Since I don't know what kind of data you have, I cannot give you the best answer. However, I think blast output has starting and ending position of the hit on your genome. & you have excel sheet, then sort your hits according to their position. then you can check the overlap (subtract the starting position of one hit from the ending position of the previous hit if the value is very negative, they are overlapping).

0
Entering edit mode

Sorry for not mentioning earlier ... I did the protein blast .I blasted my proteome with the phage protein sequences.

0
Entering edit mode

Sorry, then I have misunderstood your purpose & I cannot figure out what you want to do...

0
Entering edit mode

Sorry for not being clear... Well I want to check what all phage genes are present in my genome. The phage database contains protein sequences. So, I ran the blastp for the proteome of my genome against the phage database . And I get too many hits . Right now I'm having around 289 hits even after sorting out with respect to evalue percent ID and query coverage coverage. How do I sort it further?

0
Entering edit mode

Firstly, I think there are some proteins (less than 100, I think) in "the proteome of my genome" (right?). What is "289 hits"? Is it means "you merged all of results"? Perform blastp using one protein sequence then the result tells you what kind of protein is it. For example, when you have 90 proteins in "the proteome of my genome", you should run blastp 90 times and you will get 90 results. Check the results one by one. One protein may have hits with descriptions like "capsid protein", then the protein may be a capsid protein. The another protein may have hits with "protease", then the protein may be.... Do it 90 times. We cannot avoid this step.

0
Entering edit mode

If you have sequenced a genome of a phage and want to know what genes it has, you need to annotate that genome. Blast is not the tool for this. You need something like RAST or Prokka

0
Entering edit mode
4.7 years ago
theobroma22 ★ 1.2k

From the command line you can type blastp -help. This will show you all of the available options. So, if you just want the top hit five hit you can you the option, -max_target_seqs 5 in the blastp command line. If you just want the top hit replace the 5 with a 1. Is this what you were looking for??

0
Entering edit mode

This is my understanding of the question that the OP is looking for, and what I suggest as well.