Question: Statistics problem: Significance of BLAST e-values of two species
1
4.2 years ago by
Gene Mustermann10 wrote:

I did a reciprocal BLAST search to identify homologues genes (orthologs and paralogs) from 2 different species (A, B) in the genome of species C. What I'm trying to find out is whether e-values of homologs from species A are significantly higher then from species B, and whether A is more likely to have orthologes/paralogs in C.

Is that even possible or does it make any sense? How do I calculate that? Another problem is that I also have E-values of 0 (< 2.225074E-308 I believe) and I don't know how to deal with them in a statistical evaluation (obviously the best hits).

I am rather new to bioinformatics ... and please excuse my english.

Thanks

blast rbh homolog gene • 1.3k views
modified 4.2 years ago by a.zielezinski9.3k • written 4.2 years ago by Gene Mustermann10

Have you looked at this link?

Thanks, I assume I should use scores rather than evlaues?

6
4.2 years ago by
a.zielezinski9.3k
a.zielezinski9.3k wrote:

As Jean-Karim pointed, e-values are not comparable between BLAST searches of two different databases. Bitscore won't help you either, because this score also depends on the size of the database.

Solution 1 (more accurate):

Instead of playing with BLAST scores by yourself, just use the InParanoid software, which finds orthologs and paralogs between two FASTA files (A and C). InParanoid is based on Reciprocal Best BLAST Hit, but it does a lot of things for you. As a result, you get a list of all orthologs and paralogs and, most importantly, confidence value for each ortologous/paralogous pair. In this way, you can apply t-test (if values are normally distributed) or Mann-Whitney test (if they are not).

Solution 2 (primitive, questionable):

If you don't want to use InParanoid, just simple RBH then you need to tell BLAST that the search space is equal among BLAST of the same query to multiple databases. Then, both, E-values and bitscores will be comparable among all results because the search space will be consistent. You do this by simply using `-dbsize` flag when running BLAST locally. In my opinion, you can use the average of the sequences in A, B and C. For example, if A has 4000 sequences, B has 3000 and C has 5000 sequences, I would set -dbsize flag to 4000.

For example:

``````blastp -query A.fasta -db C.fasta -dbsize 4000
``````

Yes, I know inparanoid but my professor wanted me to use the RBH method "by hand".

My DB size is for all BLASTs of the forward search the same though (size of genome C). I guess I could use bit scores then?

I edited my answer. I hope this is what you need.

Solution 2 also doesn't work in my case. I've got ~100 genes on one side (small fasta file; A, B) and a genome (large fasta; C) on the other. So averaging is probably not a good idea.

So I now took the bitscores of the forward search (equal DB size anyway); Welch test (R) results in Ha: unequal means of these bitscores. So apparently one species had better results, which makes sense since its closer related.

1

I understand. You did the right thing taking bitscores only from forward BLAST search. RBH approach doesn't work with partial sets of sequences (e.g. A and B in your case).

Why not? I know about the high false negative rate, is there something else I missed?

2
4.2 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

E-values are generally not comparable because they depend on the database size and the sequence length. Just like p-values, e-values are only an indication of statistical significance and do not imply any strength of relationship between the two sequences. If you need to use a BLAST statistics for comparing alignments then use the bit score.