Extract Features From Gbk Via Bash
7.7 years ago
Phil S. ▴ 690

Hi

is there any way to extract the locustag , join and complement information from a genbank via bash (one liner)? What gives me problems is the line break which may occur in the complement or join field of the genbank entry!

Phil

ps. the lines I want to extract look like this:

/locus_tag="XXX_00002"

join(5485..5700,5784..6116,6230..6377,6434..6574,

6683..6819,7047..7314,7428..7938,8031..8307)

complement(join(34726..34766,34850..34866,34975..35287,

35392..35518,35604..35744,35831..35928,36051..36156))

Maybe, but it'd be a lot easier to use something like biopython or bioperl that can parse genbank files.

historically this is why perl, python exists in the first place - so that we don't have to process text in bash

Normally i solve stuff like this using boppython. The thing was that I wasn't working with my own laptop so i just needed to came up with a neat trick to solve it just the one time... Thanks for your suggestions anyways!

7.7 years ago
xb ▴ 420

sed is one option.

The sample file,

cat sample.gbk

## output
## ignored, same as in the question


To join the unwanted line breaks,

sed '/^$/d' sample.gbk | sed 'N;/,\n/s/"\? *\n//;P;D' ## output /locus_tag="XXX_00002" join(5485..5700,5784..6116,6230..6377,6434..6574,6683..6819,7047..7314,7428..7938,8031..8307) complement(join(34726..34766,34850..34866,34975..35287,35392..35518,35604..35744,35831..35928,36051..36156))  Then there are many ways to capture the contents. Below is one way to capture the "join" information, sed '/^$/d' sample.gbk | sed 'N;/,\n/s/"\? *\n//;P;D' | sed -n 's/^join($$.\+$$)$/\1/p' ## output 5485..5700,5784..6116,6230..6377,6434..6574,6683..6819,7047..7314,7428..7938,8031..8307  Or all information in one line, sed '/^$/d' sample.gbk | sed 'N;/,\n/s/"\? *\n//;P;D' | tr "\n" ";" | sed -r 's/^\/locus_tag="(.+)".*join$$([^()]+)$$.*complement$$join\(([^()]+)$$\).*/\1\n\2\n\3\n/g'

## output
XXX_00002
5485..5700,5784..6116,6230..6377,6434..6574,6683..6819,7047..7314,7428..7938,8031..8307
34726..34766,34850..34866,34975..35287,35392..35518,35604..35744,35831..35928,36051..36156

Thank you, works really nice!