Question

Append assembly accession to nucleotide accession number in RefSeq Genbank file

0

Entering edit mode

2.1 years ago

genomes_and_MGEs ▴ 10

Hi everyone,

When I want to append the filename to the contig header in a multi-fasta file, I usually use

for F in *.fasta; do N=$(basename $F .fasta) ; bbrename.sh in=$F out=${N}_mod.fasta prefix=$F addprefix=t ; done

However, this doesn't work in genbank files. When I want to split muti-genbank files, I use

cat > splitgbk.py
from Bio import SeqIO
import sys
for rec in SeqIO.parse(sys.stdin, "genbank"):    SeqIO.write([rec], open(rec.id + ".gbk", "w"), "genbank")

for F in *.gbff; do python splitgbk.py < $F ; done

This generates multiple *.gbk files, with the structure "accession_number.gbk". However, I would like to have the filename appended before the accession number, so that each spllited genbank file has the structure ""filename_accession.gbk". Can you guys help me out? Thanks!

sequence • 734 views

ADD COMMENT • link updated 2.1 years ago by cpad0112 21k • written 2.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

how many contigs does a file have? Can you post an example with input and expected output?

If you want to append file name to ID / header a fasta file, try following with a small file:

$ awk '/>/ {sub("$","_"FILENAME,$0)}1' test.fa

ADD REPLY • link 2.1 years ago by cpad0112 21k

0

Entering edit mode

Thanks for the reply. However, my goal is not to append filename to the header of a fasta file, but to append filename to the accession number of a genbank file. For example: if I split the genome with filename GCF_000007805.1_ASM780v1.gbk, I'll have 3 replicons: NC_004578.1.gbk, NC_004633.1.gbk, NC_004632.1.gbk. My goal is to produce the following output: GCF_000007805.1_ASM780v1_NC_004578.1.gbk, GCF_000007805.1_ASM780v1_NC_004633.1.gbk, GCF_000007805.1_ASM780v1_NC_004632.1.gbk.