I am attempting to use BLAST to identify UniGene ESTs that align with mapped SSR and SNP marker sequences (i.e. their sequences and positions are known), thereby allowing me to infer the position of the EST. The protocol I currently have for doing so is not very detailed, so I have been using marker and EST datasets that my PI had used in a previous study in an effort to replicate the results of that study (specifically, the number of ESTs that mapped to each of the chromosomes in our study organism) so that I know what cutoffs and parameters to apply to future BLAST runs. As far as I know, I am in possession of the files used in the initial study. The protocol proceeds as such (I have omitted unnecessary details):
- BLAST EST sequences against the mapped marker sequences (blastn, evalue cutoff e-5). For SSRs:
cutoff at evalue > e-20
cutoff at alignment length < 76 bp
cutoff when [(% identity < 85%) & (bit score < 150) & (e-value > e-30)]
cutoff at % identity <97
- Plug in marker map location (based on the given map).
My BLAST runs have been input accordingly:
blastn -task blastn -query EST_seq.fas -db marker_sequences -evalue 1e-5 -outfmt 6 -out EST_markers.txt
Although not indicated by the protocol, I had it explained to me that the reciprocal BLAST should be run and hits from each should be cross-referenced (i.e. if an EST-marker hit occurred on the EST-to-marker BLAST run and the marker-to-EST run, then it would be kept--this seems sound enough). The accompanying BLAST run looks like this:
blastn -task blastn -query marker_seq.fas -db EST_sequences -evalue 1e-5 -outfmt 6 -out markers_EST.txt
Sequence databases were created using the following:
makeblastdb –dbtype nucl –in marker_seq.fas –out marker_sequences –max_file_sz 3GB makeblastdb –dbtype nucl –in EST_seq.fas –out EST_sequences –max_file_sz 3GB
The application of the additional cutoffs and the calculation of ESTs per chromosome has been via Excel (not optimal, of course) Duplicate hits are often removed before per-chromosome calculations are made, and markers that have been mapped to multiple locations (even on different locations within the same chromosome) are accounted for when assigning ESTs to chromosomes. I have been unable to replicate the per-chromosome EST totals found in the study (I am usually obtaining genome-level surpluses compared to the EST counts of the earlier study). Given what I've described so far, I'm wondering if a glaring error is obvious to some of the more experienced folks out there. There are some problem areas that I think might potentially troublesome:
- The protocol mentions the use of blastn (obviously), but it does not go on to specify which task to use. For the longest time, I had not been specifying the task (I was unaware I could/had to). Apparently the default task of blastn is megablast--this brought me overall genomic surpluses, with chromosomes demonstrating surpluses and deficits. I then tried specifying the task as blastn (as written in the above code), and I have gotten rid of the problem of deficits; surpluses now exist for each chromosome. The surpluses are not extreme and the totals are proportionally similar to the values obtained in the original study--they are just too high. I am not sure whether megablast or blastn was the intended task.
- I am not sure whether I should restrict my searches to the plus strand only.
- I am not sure whether I should remove duplicate EST-marker hits. I am only questioning this because I had tried restricting a BLAST search to the plus strand and obtained a total of 1363 ESTs acrossthe genome before duplicate removal--the original study obtained 1362. However, some chromosomes had deficits and others (necessarily) had surpluses compared to the original study's findings. It seems too good to be true, and the deficits and surpluses may be due to some mismatching when cross-referencing the marker map that I'll have to resolve...
- I am not sure how commonplace reciprocal verification (applying BLAST both ways) is, and the step is not listed in the protocol I am working with. I'm wondering if it would be an implied step.
- The third SSR cutoff seems to be oddly specific and I am not sure if this cutoff was adhered to when final calculations were made.
I apologize for the long question, but I am very troubled by this problem--which largely originates from the protocol being nonspecific.
Update: I have downloaded a legacy version of BLAST, using blastall and specifying the blastn program, and while fewer hits were obtained, there is still of surplus of ESTs across the genome compared to the results of the earlier study, and some chromosomes possess EST deficits compared to the original, as well as accompanying large per chromosome surpluses.