Question

How can the output be renamed when converting genbank files into GFF3 format with BioPerl?

0

Entering edit mode

2.8 years ago

Constanza • 0

Hi everyone.

I would really appreciate a hand with the following:

I am trying to convert 1911 files in GBK format to GFF3 and FASTA. For this purpose I've installed BioPerl (1.7.8) and the module Bio::DB::GFF(1.7.4) via cpan in Straberry Perl for Windows. When executing perl bp_genbank2gff3.pl --dir path_to_files --noinfer --split --outdir path_to_output [ 1 ] the code runs alright but only 1270 new files are created: 635 GFF3 and 635 fastas.

1) Is there something that is terminating the process before it has finished running all the files? How can I correct it?

It might be relevant to note that the resulting files do not mantain the name of the genbank input (which is properly indexed (i.e. "gen011_ctg009_region001.gbk") in order to keep them trackable). Instead, they appear to be the LOCUS line of the genbank files, being many of them called "contigX" or "scaffo_000000X". The latter makes me think that the loss of proper indexation has rewritten 1276 files of the total input as when scrolling in the prompt I recognize no arbitrary skipping of the data.

2) Is there a way to fix a parameter in order to indicate the name of the input file as the name of the output files?

Particularly, I need this change to be reflected in the description line of the fastas and in the first column of the GFFs, for which it would make more sense to change the LOCUS line of the GBKs.

3) Any ideas of how I could achieve this other than manually if there was no solution for the question #2 ?

Thanks in advance.

gff3 bioperl rna-seq genbank genbank2gff3.pl • 816 views

ADD COMMENT • link 2.8 years ago by Constanza • 0

score 1 · Accepted Answer · 2021-07-21

Hello,

I found a solution that is probably not be the most beautiful one; however it did let me continue with the transcriptomic analyses that requested the GFF and FASTA files as input data.

1) The processing of the data was indeed hindered by the problem of many files with the same name.

In order to correct this, I merged the GBK files (biosynthetic gene clusters) for each genome, reducing the number of files to manually curate from 1911 to 190 with the following script 1:

#!/C:/Strawberry/bin/perl/
use strict;
use warnings;

##create array that will store a list of filenames
my @doclist = glob( '*ISL021*' ); 

##create output file, arg: file handle, specify mode (read, write, etc), output file name
open( OUTPUT, ">", "ISL021.gbk" ); 

##copy each of the selected files (one at the time) to the output file
foreach my $filename ( @doclist ){

    open( INPUT, $filename );

    ##copy the content of the input file over to the output file
    print OUTPUT <INPUT>;
    close( INPUT );
}

close( OUTPUT );

It could be perfected by automatizing the search of each of the genomes (ISL###) providing a list.

2) The script above presented solves the output filename issue.

3) It had to be manually revised and corrected (with a genome id prefix) as "LOCUS" and "ACCESSION" in GBK files defined the "seqname" (first column of GFFs) and the description line of FASTAs, respectively; besides other indexation inaccuracies (i.e. contig2.1 and contig2.2 were both called contig2).

Despite my question might have been a bit too specific for the purpose of this forum, I hope that it could help someone if they happened to encounter themselves analysing data processed by other hand when still beginners in bioinformatics.