I would really appreciate any help you can provide with the following coding problem (using PERL). I have an aligned multiple sequence alignment (fasta format) as my input, along with a file containing the co-ordinates of all open reading frames (ORFs) and the exons that make them up, I am wanting to slice out the ORFs according to those co-ordinates. So far I have code that works well for individual ORFs (and have no problem with the reverse complementing etc) but the problem lies in extracting and concatenating multi-exon ORFs.
Thus far my code reads in the multiple sequence alignment as follows
use Bio::SimpleAlign; use Bio::AlignIO; $str = Bio::AlignIO->new(-file => $inputfilename, -format => 'fasta'); $aln = $str->next_aln(); and deals with the splicing as follows $mini = $aln->slice($array, $array); $out = Bio::AlignIO->new(-file => $array, -format => 'fasta'); $out->write_aln($mini);
an example of the input file containing co-ordinates looks like this
Start Stop Strand Name Note 24 89 + ORF1 exon1 165 560 - ORF2 exon1 680 1004 + ORF3 exon1 1240 1760 + ORF3 exon2 1790 2360 + ORF3 exon3 2600 2900 - ORF4 exon1 2850 3100 + ORF 5 exon1
Would anyone know of a clever way to extract the individual exons for ORF3 and then concatenate the files (side by side to ensure that the multiple sequence alignment is not comprised)? My initial thought was to change the co-ordinate file to a GFF type file and use Bio::Tools::GFF but I don't think this is compatible with a multiple sequence alignment as the input.
Any help would be hugely appreciated!