Append assembly accession to nucleotide accession number in RefSeq Genbank file
0
0
Entering edit mode
2.1 years ago

Hi everyone,

When I want to append the filename to the contig header in a multi-fasta file, I usually use

for F in *.fasta; do N=$(basename $F .fasta) ; bbrename.sh in=$F out=${N}_mod.fasta prefix=$F addprefix=t ; done

However, this doesn't work in genbank files. When I want to split muti-genbank files, I use

cat > splitgbk.py
from Bio import SeqIO
import sys
for rec in SeqIO.parse(sys.stdin, "genbank"):    SeqIO.write([rec], open(rec.id + ".gbk", "w"), "genbank")

for F in *.gbff; do python splitgbk.py < $F ; done

This generates multiple *.gbk files, with the structure "accession_number.gbk". However, I would like to have the filename appended before the accession number, so that each spllited genbank file has the structure ""filename_accession.gbk". Can you guys help me out? Thanks!

sequence • 734 views
ADD COMMENT
0
Entering edit mode

how many contigs does a file have? Can you post an example with input and expected output?

If you want to append file name to ID / header a fasta file, try following with a small file:

$ awk '/>/ {sub("$","_"FILENAME,$0)}1' test.fa
ADD REPLY
0
Entering edit mode

Thanks for the reply. However, my goal is not to append filename to the header of a fasta file, but to append filename to the accession number of a genbank file. For example: if I split the genome with filename GCF_000007805.1_ASM780v1.gbk, I'll have 3 replicons: NC_004578.1.gbk, NC_004633.1.gbk, NC_004632.1.gbk. My goal is to produce the following output: GCF_000007805.1_ASM780v1_NC_004578.1.gbk, GCF_000007805.1_ASM780v1_NC_004633.1.gbk, GCF_000007805.1_ASM780v1_NC_004632.1.gbk.

ADD REPLY
0
Entering edit mode

can you post a small example and expected output?

ADD REPLY

Login before adding your answer.

Traffic: 1821 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6