Hi, any advice is appreciated!
I am trying to write a Perl program which uses Bioperl to parse a tblastn report of gene protein query sequences against a nucleotide genome sequence. I am using a sequence length of 80% and a fraction identical measure of 80% relative to the query as cut-offs. The program reports '1' if present, '0' if absent next to gene name in the output. Also I used tiling methods in the program as the nucleotide sequences are made up of draft contig sequences.
However, when I do a positive control and do a tblastn of the protein sequences against the same genome sequence that the proteins have originated from, many genes are reported as absent ("gene" 0 is the ouput). Any help as to this problem?
Here is the program:
#!/usr/bin/perl
use strict;
use Bio::SearchIO;
use Bio::Search::Tiling::MapTiling;
my $in = new Bio::SearchIO(-format => 'blast',
-file => $ARGV[0]);
while( my $result = $in->next_result ) {
## $result is a Bio::Search::Result::ResultI compliant object
while( my $hit = $result->next_hit ) {
## $hit is a Bio::Search::Hit::HitI compliant object
my $tiling =
Bio::Search::Tiling::MapTiling->new($hit);
if( $tiling->frac_identical('query') >= 0.8 && (100 / $result->query_length) * $tiling->length('query') >= 80) {
print $result->query_name, " 1", "\n"; }
else {
print $result->query_name, " 0", "\n";}
}
}
Sample output of positive control run (all genes should have a 1 next to them in the positive control):
yaaF 0
UTI89_C0033 0
dapB 1
UTI89_C0035 1
carA 1
carB 1
UTI89_C0038 1
UTI89_C0038 0
UTI89_C0039 0