Question: Modyfing a Genbank file
0
gravatar for matt81rd
5 months ago by
matt81rd0
matt81rd0 wrote:

Hi i am trying to search through a file for a specific list of words. If one of those words if found i want to add a newline underneath and add this phrase \colour = 1 (I don't want to remove the orginal word i am searching for).

An extract of the file for context and format:

> LOCUS       contig_2_pilon_pilon 5558986 bp    DNA     linear   BCT
> 16-JUN-2020 DEFINITION  Escherichia coli O157:H7 strain (270078)
> ACCESSION    VERSION KEYWORDS    . SOURCE      Escherichia coli 270078
> ORGANISM  Escherichia coli 270078
>             Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
>             Escherichia. COMMENT     Annotated using prokka 1.14.6 from
>             https://github.com/tseemann/prokka. FEATURES             Location/Qualifiers
>      source          1..5558986
>                      /organism="Escherichia coli 270078"
>                      /mol_type="genomic DNA"
>                      /strain="strain"
>                      /db_xref="taxon:562"
>      CDS             61523..61744
>                      /gene="pspD"
>                      /locus_tag="JCCJNNLA_00057"
>                      /inference="ab initio prediction:Prodigal:002006"
>                      /inference="similar to AA sequence:RefSeq:EG10779-MONOMER"
>                      /codon_start=1
>                      /transl_table=11
>                      /product="peripheral inner membrane heat-shock protein"
>                      /translation="MNTRWQQAGQKVKPGFKLAGKLVLLTALRYGPAGVAGWAIKSVA
>                      RRPLKMLLAVALEPLLSRAANKLAQRYKR"

Here is one of the lists of words i am looking for throughout the file:

regulation_list=["anti-repressor","anti-termination","antirepressor","antitermination","antiterminator","anti-terminator","cold-shock","cold shock","heat-shock","heat shock","regulation","regulator","regulatory","helicase","antibiotic resistance","repressor","zinc","sensor","dipeptidase","deacetylase","5-dehydrogenase","glucosamine kinase","glucosamine-kinase","dna-binding","dna binding","methylase","sulfurtransferase","acetyltransferase","control","ATP-binding","ATP binding","Cro","Ren protein","CII","inhibitor","activator","derepression","protein Sxy","sensing","sensor","Tir chaperone","Tir-cytoskeleton","Tir cytoskeleton","Tir protein","EspD"]

As you can see that extract contains one of th ephrases i am looking for and i want to add a newline underneath with the phrase /colour = 1

Any help would be great!

genbank python • 162 views
ADD COMMENTlink modified 5 months ago by JC12k • written 5 months ago by matt81rd0

if there are not too many you need to process you can open those kind of file(s) in a genome browser (apollo, artemis GenomeView, ... ) and change the color of the feature using the browser. afterwards you can then save the file again.

ADD REPLYlink written 5 months ago by lieven.sterck9.5k
0
gravatar for JC
5 months ago by
JC12k
Mexico
JC12k wrote:

Perl solution:

#!/usr/bin/perl

use strict;

use warnings;

my @regulation_list = ("anti-repressor","anti-termination","antirepressor","antitermination","antiterminator","anti-terminator","cold-shock","cold shock","heat-shock","heat shock","regulation","regulator","regulatory","helicase","antibiotic resistance","repressor","zinc","sensor","dipeptidase","deacetylase","5-dehydrogenase","glucosamine kinase","glucosamine-kinase","dna-binding","dna binding","methylase","sulfurtransferase","acetyltransferase","control","ATP-binding","ATP binding","Cro","Ren protein","CII","inhibitor","activator","derepression","protein Sxy","sensing","sensor","Tir chaperone","Tir-cytoskeleton","Tir cytoskeleton","Tir protein","EspD");

my $list_regex = join "|", @regulation_list;

while (<>) {
    print;
    if (m|(\s+)/product=.*($list_regex)|i) {
        my $pre = $1;
        print "$pre/colour = 1\n";
    }
}

testing it:

$ perl checkWords.pl < file.gbk
LOCUS       contig_2_pilon_pilon 5558986 bp    DNA     linear   BCT
16-JUN-2020 DEFINITION  Escherichia coli O157:H7 strain (270078)
ACCESSION    VERSION KEYWORDS    . SOURCE      Escherichia coli 270078
ORGANISM  Escherichia coli 270078
            Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
            Escherichia. COMMENT     Annotated using prokka 1.14.6 from
            https://github.com/tseemann/prokka. FEATURES             Location/Qualifiers
     source          1..5558986
                     /organism="Escherichia coli 270078"
                     /mol_type="genomic DNA"
                     /strain="strain"
                     /db_xref="taxon:562"
     CDS             61523..61744
                     /gene="pspD"
                     /locus_tag="JCCJNNLA_00057"
                     /inference="ab initio prediction:Prodigal:002006"
                     /inference="similar to AA sequence:RefSeq:EG10779-MONOMER"
                     /codon_start=1
                     /transl_table=11
                     /product="peripheral inner membrane heat-shock protein"
                     /colour = 1
                     /translation="MNTRWQQAGQKVKPGFKLAGKLVLLTALRYGPAGVAGWAIKSVA
                     RRPLKMLLAVALEPLLSRAANKLAQRYKR"
ADD COMMENTlink written 5 months ago by JC12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2468 users visited in the last hour
_