how to extract specific region in genbank file (.gbk)
2
0
Entering edit mode
18 months ago
Bio1421 • 0

In the "genbank file" posted on ncbi, "Change region show" allows features of a specific location to be extracted into a gbk file.

In addition, the positions of the extracted features count from 1.

EX) raw. gbk - gene A position: 345612..352112

select.gkb - gene A position: 1..6500

Figure_ncbi_change region show

Can I use my gbk file to run this function with Python? Or can you tell me about such a tool? I'm asking because I can't find it no matter how much I look for

Thank you for reading it. Have a nice day

genbank parsing position • 2.9k views
ADD COMMENT
2
Entering edit mode
18 months ago
Joe 21k

Yep you can use BioPython to very conveniently 'slice' a genbank.

I have code for this here: https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py

The important bit is here if you don't want to use the code as-is: https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py#L118-L134

ADD COMMENT
2
Entering edit mode
18 months ago
GenoMax 142k

You can use Entrezdirect to get regions of sequence.

If you want GenBank format then use

$ efetch -db nuccore -id CP014051 -seq_start 583556 -seq_stop 590090 -format gb > region.gb

If you simply need the sequence then use

$ efetch -db nuccore -id CP014051 -seq_start 583556 -seq_stop 590090 -format fasta > region.fa
ADD COMMENT
0
Entering edit mode

Is there a way to search so that we only need to supply two txt files, one for the file's ID/name and the other for the start and stop of that gene? For example, if I want to create gbk files for 20 genes of interest out of a lot of complete genome gbk data files and each gbk files have those genes .

ADD REPLY
0
Entering edit mode

Not by default. You will need to use a way of feeding the efetch command with the three variables. For example if you had a file with the three fields separated by tab then you could do

awk -F "\t" '{OFS=" "}{print $1,$2,$3}' your_intervals| xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format gb > $0_$1_$2.gb' 

This will produce a separate file for each interval.

ADD REPLY
0
Entering edit mode

Hi Genomax Sir, i have 60 genes and i want to create gbk file for that only for example- desired genes are present in every genomes and i have 300 e.coli genomes so i want to create 300 gbk files for those 60 genes only rest genes are not required. does is possible to do that? Please reply its will be very helpful for..

Thanks a lot for your time!

ADD REPLY
0
Entering edit mode

It is possible as I show above. You will need three pieces of information per interval you want to retrieve. Accession, star and stop.

ADD REPLY
0
Entering edit mode

Hi Genomax Sir, first of all thank you for your time. I have checked above script regarding gbk file, actually it work but at a time it extract one gene in gbk file however i want to make gbk file of all of the gene of interest. i mean all those 60 genes should be present in each and every 300 gbk files. Does it possible to make gbk file from local ncbi annotated gbk file not from ncbi annotated data?

Thank you!

ADD REPLY
0
Entering edit mode

Then you need to make files for those 60 genes with coordinates for each genome.

If you have a properly formatted GenBank file locally then you may be able to use this (or modify as needed): Slicing Genbank File by 'gene_id' range

ADD REPLY
0
Entering edit mode

Thank you Genomax sir from the bottom of my heart, and appreciate all you have done. Your help means a lot for me. seriously i was stuck from past 2 week but now its all fine. Thank you again sir for your time.

ADD REPLY

Login before adding your answer.

Traffic: 2297 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6