Question: Blast Database Size Influence On Number Of Significant Hits
0
gravatar for PoGibas
6.6 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

I have a set of gene sequences and specific sequence.

cat genes.fa
>Gene_1_chr1_1000_1200  
ACGT...
>Gene_2_chr2_3000_3400
TTAT...  

cat sequence.fa
>Searchable_sequence
ACGG...

I want to search for this specific sequence in 1. gene sequences; 2. gene flanking sequences.

Gene flanking sequence = Gene coordinates +/- gene size -- Gene flanking sequence is gene locus plus/minus gene locus size (flanking sequence fasta file is two times bigger than original gene sequences fasta).

cat gene_flanks.fa
>Gene_1_Flank5_chr1_800_1000  
CAGT...
>Gene_1_Flank3_chr1_1200_1400  
AAGT...
>Gene_2_Flank5_chr2_2600_3000
TTAT...  
>Gene_2_Flank3_chr2_3400_3800
ACAT...  

Gene database size: 2 sequences - 600 nucleotides
Flanks database size: 4 sequences - 1200 nucleotides

I use BLAST for search:

blastn  -task blastn -db genes -query $Sequence -outfmt 6 -out - | wc -l   
6
blastn  -task blastn -db flanks -query $Sequence -outfmt 6 -out - | wc -l   
2

Number of hits between original gene set and flanks set differ. My questions are:
1. Am going to use number of hits for enrichment analysis - how accurate is it to compare number of hits between databases that have different size? (evalue depends of database size and I might be getting bias because of smaller/bigger database size).
2. I want to filter BLAST hits using evalue - can I use same evalue for databases that have different size?

blast • 3.1k views
ADD COMMENTlink modified 6.6 years ago by lelle820 • written 6.6 years ago by PoGibas4.8k
1
gravatar for lelle
6.6 years ago by
lelle820
Berlin
lelle820 wrote:

You might want to have a look at the math behind blast bitscores and e-values: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

In general I think what you see makes sens, because in a longer sequence the possibility to get a random hit is higher. So the same hit is not as significant in the bigger database. If you do not want to follow this reasoning you should probably work with the bit scores instead of the e-values.

ADD COMMENTlink written 6.6 years ago by lelle820

or switch db and query ;-)

ADD REPLYlink written 6.6 years ago by Manu Prestat4.0k
1

Query length should also be taken into acount for e-values if I am not mistaken. You can not outsmart Blast I'm afraid.

ADD REPLYlink written 6.6 years ago by lelle820
1

Query sequence length is taken into account indeed, but not the number of sequences included in query file. So the trick still works :-)

ADD REPLYlink written 6.6 years ago by Manu Prestat4.0k

That is true and also the reason why you should not forget to do multiple testing correction on your blast results... unless you are deliberately trying to ignore this issue.

ADD REPLYlink written 6.6 years ago by lelle820

What about merging both fasta files into one database -> single search -> splitting BLAST output for every different data-set?

ADD REPLYlink modified 6.6 years ago • written 6.6 years ago by PoGibas4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1257 users visited in the last hour