I have a set of gene sequences and specific sequence.
cat genes.fa >Gene_1_chr1_1000_1200 ACGT... >Gene_2_chr2_3000_3400 TTAT... cat sequence.fa >Searchable_sequence ACGG...
I want to search for this specific sequence in 1. gene sequences; 2. gene flanking sequences.
Gene flanking sequence = Gene coordinates +/- gene size -- Gene flanking sequence is gene locus plus/minus gene locus size (flanking sequence fasta file is two times bigger than original gene sequences fasta).
cat gene_flanks.fa >Gene_1_Flank5_chr1_800_1000 CAGT... >Gene_1_Flank3_chr1_1200_1400 AAGT... >Gene_2_Flank5_chr2_2600_3000 TTAT... >Gene_2_Flank3_chr2_3400_3800 ACAT... Gene database size: 2 sequences - 600 nucleotides Flanks database size: 4 sequences - 1200 nucleotides
I use BLAST for search:
blastn -task blastn -db genes -query $Sequence -outfmt 6 -out - | wc -l 6 blastn -task blastn -db flanks -query $Sequence -outfmt 6 -out - | wc -l 2
Number of hits between original gene set and flanks set differ. My questions are:
1. Am going to use number of hits for enrichment analysis - how accurate is it to compare number of hits between databases that have different size? (evalue depends of database size and I might be getting bias because of smaller/bigger database size).
2. I want to filter BLAST hits using
evalue - can I use same
evalue for databases that have different size?