Question

Slicing Genbank File by range

0

Entering edit mode

3.2 years ago

kamel ▴ 70

Dear colleagues, I have a genbank file annotated from a genome draft (contig_1 to contig_89). I'm interested in extracting only a region 20,500 to 21,500 from contig_2 and I would like to keep the gbk format as output. Can someone help me with this.

Thanks in advance !

sequence extract gbk python • 1.0k views

ADD COMMENT • link 3.2 years ago by kamel ▴ 70

1

Entering edit mode

Check this thread Slicing Genbank File by 'gene_id' range

ADD REPLY • link 3.2 years ago by Arup Ghosh 3.2k

1

Entering edit mode

Related to the above, the code that was derived from is here: https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py

The linked thread was an adaptation to slice between features, but if you want to use indices, its actually even easier.

ADD REPLY • link 3.2 years ago by Joe 21k

0

Entering edit mode

Thanks @Joe for your feedback. I used this script on the genbank of the reference strain but got the error below. Please also find the command I used.

./Genbank_slicer.py -g sequence.gbk -s 45948 -e 92378

 Traceback (most recent call last):
  File "./Genbank_slicer.py", line 283, in <module>
    main()
  File "./Genbank_slicer.py", line 271, in main
    print(subRecord.format("genbank"))
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqRecord.py", line 719, in format
    return self.__format__(format)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqRecord.py", line 757, in __format__
    SeqIO.write(self, handle, format_spec)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 530, in write
    count = writer_class(handle).write_file(sequences)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 244, in write_file
    count = self.write_records(records, maxcount)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 218, in write_records
    self.write_record(record)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/InsdcIO.py", line 981, in write_record
    self._write_the_first_line(record)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/InsdcIO.py", line 744, in _write_the_first_line
    raise ValueError("missing molecule_type in annotations")
ValueError: missing molecule_type in annotations

ADD REPLY • link 3.2 years ago by kamel ▴ 70

0

Entering edit mode

I solved this problem by installing biopython: 1.77

pip3 install biopython == 1.77

Do you have any idea how to specify the contig number on which I can extract the part that interests us. For example I want to target contig 2: from 20,500 to 21,500 ??

ADD REPLY • link 3.2 years ago by kamel ▴ 70

1

Entering edit mode

Is your input file a multi-genbank in that case?

If you know which of these records you want (2) that makes it a little easier, but still needs quite a lot of code refactoring. Is it a safe assumption that your contigs are listed in order within the file?

I don't have time right now to write the full code, so there is some pseudo-code below. If I get chance I'll try and work this up properly.

contig_wanted = 2    # pass this in from a dedicated argparse argument
# Iterate each record in order
for i, record in enumerate(SeqIO.parse("/path/to/genbanks.gb", "genbank")):
    #check index against requested (remember python starts at 0 so you need the i-1th index
    # alternatively you can just adjust enumerate to start from 1 instead of 0.
    if i == contig_wanted:
         # do slicing
   #write file etc

ADD REPLY • link 3.2 years ago by Joe 21k