Question

Deleted:Perl. How to extract gene entry, CDS and gene sequence from a specified range from GenBank files?

1

Entering edit mode

2.1 years ago

Fungi-Beware! ▴ 10

I'm a complete beginner at Perl and don't have the slightest idea on how to do this task. I feel completely defeated.

The script I'm supposed to make has to:

Work with any GenBank file provided (.dat) and save the output in a new .txt file.
Search a given range of nucleotides in a GenBank file (given as ARGV[0], for example 100000-12000) and identify any gene entries that fit inside it.
Select the CDS entry or entries (from FEATURES) corresponding to the identified genes and extract them into a .txt file.

and

Select the nucleotide sequence corresponding to the identified gene entries (from ORIGIN). The nucleotide sequence has to be extracted alongside the gene and CDS data, written as a continuous string of uninterrupted characters(5'-3') and included in the same .txt file. Here interpreting GenBank operators (join, complement, order) is essential for proper concatenation.

The script is supposed to be ran locally. No modules.

Currently stuck at figuring out how to search and select the "gene" entries that fit inside a specified range. My idea was to compare directly to the range specified alongside "gene" in GenBank files. I guess this could work but I can't even imagine how this could get applied to the "gene" entries that feature GenBank operators. For starters I'd be satisfied with just figuring out how to apply it to the "simple" entries.

 # perl script.pl x-y (ex: 1000-2000).
use strict;
use warnings;

my $genbank = undef;
my $file= undef;
my $data = undef;
my $end= undef;
my $line= undef;
my $range= undef;
my @gene = ();
my @data = ();

unless (@ARGV){
die "You haven't specified a command line argument!\n"
};
$range= $ARGV[0];
chomp ($range);
if ($range!~ /(\d*)\-(\d*)/){
    die "Specified range value isn't in the right format!\n";
}

print "Input the name of your GenBank file (.dat):\n";
$genbank = <STDIN>;
chomp ($genbank);
open (READ, "<genbank") or die " The Genbank file $genbank can't be opened! :$!\n";
@data = <READ>;
close READ;

my ($min, $max) = split /-/, $range;
$end= $#data;
$line= 0; $line<= $end; $line++);

for ($data[$line] =~ /(gene)\s*(\d*)(\.\.)(\d*)/){
        my $start_range = $2; my $end_range = $4;
        #$1 = gene, $2 = the start value of the entry's range, $3 = "..", $4 = the end value of the entry's range
        if ($min >= $start_range && $min <= $end_range && $max >= $start_range && $max <= $end_range){
}

...

I would really appreciate everyone's opinion!

fasta sequence GenBank perl • 799 views

ADD COMMENT • link 2.1 years ago by Fungi-Beware! ▴ 10