Question

Parsing genbank or gff files based on specific nucleotide coordinates

0

Entering edit mode

17 months ago

genomes_and_MGEs ▴ 10

Hi there,

I have multiple genbank files, and would like to parse them based on specific nucleotide coordinates. For example, I have:

A.gbk 300 600 B.gbk 400 800 ...

My goal is to extract the gene in A.gbk that matches the coordinates from 300-600 nt, extract gene in B.gbk that matches the coordinates from 400-800nt, and so on. In case these coordinates match intergenic regions, show something like "-".

I know it's possible to do this for files in fasta format using bedtools getfasta, but I'm not sure how to do it when the input file is in genbank format. Converting to fasta format is not an option, since I need the gene annotations. I also have the input files in gff format, in case it's easier to extract the information I need.

Thanks!

Sequence • 643 views

ADD COMMENT • link updated 17 months ago by Joe 21k • written 17 months ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Have a look for "genbank slicing" on the forum - this can be done quite easily with Biopython.

ADD REPLY • link 17 months ago by Joe 21k

0

Entering edit mode

Thanks for this. However, this doesn't help solving my issue. I tried to run on a sample genbank file, and it only extracts gene information in case the interval is larger enough to include the gene. My problem is that the start and end coordinates comprehend a quite small nucleotide region, and my goal is to extract the "full" gene that is targeted by these small regions. For example, if I run:

Genbank_slicer.py -g A.gbk -s 10 -e 20

File A.bgbk has one gene in position 2-1000. Using the command above, it will fetch the nucleotides from 10-20, but won't fetch the gene ID. Is there a workaround?

ADD REPLY • link 17 months ago by genomes_and_MGEs ▴ 10

1

Entering edit mode

AFAIK you can still use biopython for this, you just can't use that particular script 'as is'.

You can use biopython's feature location framework to 'interrogate' a span of sequence.

I would approach it something like this:

Define sequence span
if <a feature> in <feature span>
keep the whole feature

To achieve (2) you may need to do some overlap arithmetic but that's not too difficult.

Alternatively, read in the genbank, obtain all the features (including their locations) and simply look up in the list of all features, which feature your interval spans. If your intervals are more often shorter than the features, this latter approach might be simpler (whereas if your spans were large, it would probably be simpler to look for features within).

As long as you treat this in a 'greedy' manner, such that features which 'fall off the ends' of the interval are included, that should solve your problem,

This thread might be helpful: Extract genbank features inside a given range

And this code is also somewhat relevant: https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py#L106-L113

ADD REPLY • link 17 months ago by Joe 21k