All vs All blast not self hit? Orthogroup clustering and single copy genome?
1
0
Entering edit mode
2.0 years ago
Charly • 0

Hey guys

  1. Self hit I have this actually a bit weird question about blast.

I’ve been doing some work around single copy genome construction using Reciprocal best blast hit (RBBH) method.

As I have something like 100+ annotated genome, I concatenated all annotated CDS into one fasta and makeblastdb with this fasta. Subsequently I blasted this fasta against the db I just made which resulted in a tabular out put with first column: qseqid, second column: sseqid.

With some python scripts I always (I did all-v-all blast multiple times) found that some qseids and sseqids are only present in one of this 2 columns which means that such ids do not have a “self-hit” in the out put of all-vs-all blast.

This is a bit annoying when later I am trying to construct orthologous gene cluster by “reciprocal best” method. So, I really wonder why this happens.

Blast version: 2.12.0+

  1. Orthogroup clustering and single copy genome The way I did to obtain ortho-group is to fist obtain ortho-pair. e.g., (A, B), (A, C), (B, C),(D, F)…. By using single linkage clustering algorithm, we can get 2 clusters, (A, B, C) and (D, F). However, when I am trying to get single copy core genome, like this: e.g, if we have 3 genomes

normally:

cluster1: strain1-gene1, strain2-gene1, strain3-gene1

cluster2: strain1-gene2, strain2-gene2, strain3-gene2

… Therefore we can construct core genome:

Strain1: strain1-gene1+ strain1-gene2…\

Strain2: strain2-gene1+ strain2-gene2…\

Strain3: strain3-gene1+ strain3-gene2…\

However, sometimes, by single linkage clustering, in one cluster of gene, one strain may have more than one due to linkage extension:

e.g., we have following ortho-pairs:

(strain1-gene111, strain2-gene111)\

(strain2-gene111, strain3-gene111)\

(strain1-gene111, strain3-gene222)\

So you have a cluster: strain1-gene111, strain2-gene111, strain3-gene111, strain3-gene222

In this case, when constructing the the core genome:

Strain1: strain1-gene1+ strain1-gene2…strain1-gene111\

Strain2: strain2-gene1+ strain2-gene2…strain2-gene111\

Strain3: strain3-gene1+ strain3-gene2…strain3-gene111 + strain3-gene222\

OR

Strain1: strain1-gene1+ strain1-gene2…strain1-gene111\

Strain2: strain2-gene1+ strain2-gene2…strain2-gene111\

Strain3: strain3-gene1+ strain3-gene2…strain3-gene222 + strain3-gene111\

And these 2 different concatenation order will put unknown effect when later alignment is performed and when phylogenetic relationship is compared.

So I really would like to consult about where or how should I stop the single linkage extension when such circumstances are met.

Beste Rongyin

orthogroup blast clustering single reciprocal • 992 views
ADD COMMENT
2
Entering edit mode
2.0 years ago

Blast is not a reciprocal analysis by default, which means you can't expect that each query will also have the same hit when the analysis is 'reversed'.

Main reason why this happens is usually the filtering/masking of input sequences. You can increase the number of reciprocal hits if you deactivate the (low-complexity) filtering in the blast command.

However keep in mind that this might also increase the number of false positive hits.

ADD COMMENT
0
Entering edit mode

Hi Thanks a lot.

So simply speaking, the blast algorithm will not return a self hit sometimes, may I ask what's the reason? randome process? or masking?

Also, do you have any suggestion of other software or algorithm for reciprocal analysis?

Beste

Rongyin

ADD REPLY
0
Entering edit mode

as said before ;) reason: masking

an alternative not directly, the best you can do is turn masking/filtering off and see what you get. Since you're most likely mainly interested in the best/top/... hits this will likely not affect the overall result too much.

ADD REPLY
0
Entering edit mode

Thanks a lot!

ADD REPLY

Login before adding your answer.

Traffic: 2564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6