Statistics problem: Significance of BLAST e-values of two species
2
1
Entering edit mode
5.3 years ago

I did a reciprocal BLAST search to identify homologues genes (orthologs and paralogs) from 2 different species (A, B) in the genome of species C. What I'm trying to find out is whether e-values of homologs from species A are significantly higher then from species B, and whether A is more likely to have orthologes/paralogs in C.

Is that even possible or does it make any sense? How do I calculate that? Another problem is that I also have E-values of 0 (< 2.225074E-308 I believe) and I don't know how to deal with them in a statistical evaluation (obviously the best hits).

I am rather new to bioinformatics ... and please excuse my english.

Thanks

blast gene rbh homolog • 1.6k views
0
Entering edit mode

Have you looked at this link?

0
Entering edit mode

Thanks, I assume I should use scores rather than evlaues?

6
Entering edit mode
5.3 years ago

As Jean-Karim pointed, e-values are not comparable between BLAST searches of two different databases. Bitscore won't help you either, because this score also depends on the size of the database.

Solution 1 (more accurate):

Instead of playing with BLAST scores by yourself, just use the InParanoid software, which finds orthologs and paralogs between two FASTA files (A and C). InParanoid is based on Reciprocal Best BLAST Hit, but it does a lot of things for you. As a result, you get a list of all orthologs and paralogs and, most importantly, confidence value for each ortologous/paralogous pair. In this way, you can apply t-test (if values are normally distributed) or Mann-Whitney test (if they are not).

Solution 2 (primitive, questionable):

If you don't want to use InParanoid, just simple RBH then you need to tell BLAST that the search space is equal among BLAST of the same query to multiple databases. Then, both, E-values and bitscores will be comparable among all results because the search space will be consistent. You do this by simply using -dbsize flag when running BLAST locally. In my opinion, you can use the average of the sequences in A, B and C. For example, if A has 4000 sequences, B has 3000 and C has 5000 sequences, I would set -dbsize flag to 4000.

For example:

blastp -query A.fasta -db C.fasta -dbsize 4000

0
Entering edit mode

Yes, I know inparanoid but my professor wanted me to use the RBH method "by hand".

My DB size is for all BLASTs of the forward search the same though (size of genome C). I guess I could use bit scores then?

0
Entering edit mode

I edited my answer. I hope this is what you need.

0
Entering edit mode

Solution 2 also doesn't work in my case. I've got ~100 genes on one side (small fasta file; A, B) and a genome (large fasta; C) on the other. So averaging is probably not a good idea.

So I now took the bitscores of the forward search (equal DB size anyway); Welch test (R) results in Ha: unequal means of these bitscores. So apparently one species had better results, which makes sense since its closer related.

1
Entering edit mode

I understand. You did the right thing taking bitscores only from forward BLAST search. RBH approach doesn't work with partial sets of sequences (e.g. A and B in your case).

0
Entering edit mode

Why not? I know about the high false negative rate, is there something else I missed?

2
Entering edit mode
5.3 years ago

E-values are generally not comparable because they depend on the database size and the sequence length. Just like p-values, e-values are only an indication of statistical significance and do not imply any strength of relationship between the two sequences. If you need to use a BLAST statistics for comparing alignments then use the bit score.

0
Entering edit mode

OK. As far as I understand these bit scores are normalized. I assume that means they are normally distributed so an unpaired t-test would be the right choice?