Question

How To Extract Gene Names From An Unstructured Text File?

1

Entering edit mode

12.2 years ago

bioinfo ▴ 840

Probably this is not 100% bioinformatics problem that we supposed to discuss here but I just got stuck with this general scripting problem and thought to get a quick solution. Please guide me a bit.

Description below is a column in my large txt file. I have been trying to extract last sections of the description line in the [ ] (e.g. gedP, aucB / ybdE etc). In every row, there is no specific pattern that I can see except some text text text [organism name] [gene name], (looks like these 3 sections are space separated) so I need to search for gene name in second [ ], line by line and extract the gene name (e.g. gedP).

I could do it easily if they were tab separated but space separated creates problem for me (still learning !!) but trying to solve it. Any suggestions?

 bacterial transcriptional activator protein GedP [Escherichia coli MS 182-1] [gedP]
 large cation system protein AucB [Escherichia coli UTI89] [aucB / ybdE]
 DacG gamma [uncultured bacterium] [dacG]
 RecName: Full=Quaternary ammonium compound-resistance protein qacF; AltName: Full=Quaternary ammonium determinant F [qacF]

awk • 5.5k views

ADD COMMENT • link updated 6.0 years ago by tomluec ▴ 60 • written 12.2 years ago by bioinfo ▴ 840

score 1 · Answer 1 · 2013-04-26

1

Entering edit mode

12.2 years ago

Zhen Sun ▴ 60

If you are sure about the pattern, you can split each line by "[", get the third column and remove the ending "]" like this:

cut -d '[' -f3 input_file | sed "s/]$//"

Or this python script can get you the content inside the last "[ ]".

#!/usr/bin/python

import fileinput

for line in fileinput.input():
    print line[line.rfind("[") + 1:line.rfind("]")]

ADD COMMENT • link 12.2 years ago by Zhen Sun ▴ 60

0

Entering edit mode

thats a very good idea to separate each line in three sections by separating "[" and then grabbing the part 3 (-f3). it worked for most lines in my large test file but few of them have no second section [organism name]. For instance

RecName: Full=Quaternary ammonium compound-resistance protein qacF; AltName: Full=Quaternary ammonium determinant F [qacF]

where I haven't got the gene name instead a blank space.

I was trying another way like looking for [a-zA-Z] in each line with max 5-6 characters as most genes here have 4 characters. therefore the string will skip the large [a-zA-Z1-9] section of organism and grab the gene [ ].

ADD REPLY • link 12.2 years ago by bioinfo ▴ 840

1

Entering edit mode

Then you can use the python script above to grab the content between the last pair of "[ ]". Or use Irsan's solution if you prefer sed. Just remove the first pair of "[ ]" in the sed command.

ADD REPLY • link 12.2 years ago by Zhen Sun ▴ 60

0

Entering edit mode

excellent...!!!

ADD REPLY • link 12.2 years ago by bioinfo ▴ 840

score 1 · Answer 2 · 2013-04-26

1

Entering edit mode

12.2 years ago

Irsan ★ 7.8k

This makes me look dizzy but it does the job:

sed 's:.*\[.*\].*\[\(.*\)\]:\1:g' yourFile.txt

gives:

gedP
aucB / ybdE
dacG

ADD COMMENT • link 12.2 years ago by Irsan ★ 7.8k

0

Entering edit mode

same problem with:

RecName: Full=Quaternary ammonium compound-resistance protein qacF; AltName: Full=Quaternary ammonium determinant F [qacF]

ADD REPLY • link 12.2 years ago by bioinfo ▴ 840

1

Entering edit mode

This will give you the last pair of [ ]:

sed 's:.*\[\(.*\)\]:\1:g' yourFile.txt

ADD REPLY • link 12.2 years ago by Zhen Sun ▴ 60

score 0 · Answer 3 · 2019-07-17

Might be a bit overkill, but to generalize the solution you can use an NER model. I wrote a blog post describing how to do this in ~5 minutes.

Named entity recognition is a great way to do this

https://spacy.io/ provides an easy python package for building an NER model
https://sysrev.com/ gene hunter annotation data to build a quick NER model.
http://whichgenesmatter.com/ shows this working on pubmed abstracts.

The blog post is here https://blog.sysrev.com/simple-ner/