Question: Parsing gbk file
0
gravatar for erick_rc93
11 months ago by
erick_rc9310
erick_rc9310 wrote:

I have multiples genbak (.gbk) files, and each file is a concatenated file with multiple chromosomes and plasmids and I would like to split in single files, I'm trying with the next code in awk

awk -v n=1 '/^\/\//{close("out"n);n++;next} {print > "out"n}' filename.gbk

I'd like to get the output file with the same name of input file:

filename_1.gbk
filename_2.gbk
filename_3.gbk
shell • 239 views
ADD COMMENTlink modified 11 months ago by Pierre Lindenbaum129k • written 11 months ago by erick_rc9310
1

I would strongly suggest using a proper parser like BioPython for this.

If for some reason you cannot, it should be sufficient to split the files up between the LOCUS and // lines.

ADD REPLYlink written 11 months ago by Joe17k
1
gravatar for Pierre Lindenbaum
11 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:
 wget -O - "ftp://ftp.ncbi.nlm.nih.gov/genbank/gbpln64.seq.gz" | gunzip -c | \
awk 'BEGIN{fname="";} /^LOCUS/ {close(fname);fname=sprintf("%s.gbk",$2);} {if(fname!="") print $0 >> fname; }'

$ ls *.gbk | head
CR354457.gbk
CR354458.gbk
CR354459.gbk
CR354460.gbk
CR354461.gbk
CR354462.gbk
CR354463.gbk
CR354464.gbk
CR354465.gbk
CR354466.gbk

EDIT. you want the filename:

sprintf("%s.%s.gbk",FILENAME,$2);
ADD COMMENTlink modified 11 months ago • written 11 months ago by Pierre Lindenbaum129k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 716 users visited in the last hour