How To Extract Gene Names From An Unstructured Text File?
3
1
Entering edit mode
9.4 years ago
bioinfo ▴ 830

Probably this is not 100% bioinformatics problem that we supposed to discuss here but I just got stuck with this general scripting problem and thought to get a quick solution. Please guide me a bit.

Description below is a column in my large txt file. I have been trying to extract last sections of the description line in the [ ] (e.g. gedP, aucB / ybdE etc). In every row, there is no specific pattern that I can see except some text text text [organism name] [gene name], (looks like these 3 sections are space separated) so I need to search for gene name in second [ ], line by line and extract the gene name (e.g. gedP).

I could do it easily if they were tab separated but space separated creates problem for me (still learning !!) but trying to solve it. Any suggestions?

 bacterial transcriptional activator protein GedP [Escherichia coli MS 182-1] [gedP]
large cation system protein AucB [Escherichia coli UTI89] [aucB / ybdE]
DacG gamma [uncultured bacterium] [dacG]
RecName: Full=Quaternary ammonium compound-resistance protein qacF; AltName: Full=Quaternary ammonium determinant F [qacF]

awk • 3.8k views
1
Entering edit mode
9.4 years ago
Zhen Sun ▴ 60

If you are sure about the pattern, you can split each line by "[", get the third column and remove the ending "]" like this:

cut -d '[' -f3 input_file | sed "s/]\$//"


Or this python script can get you the content inside the last "[ ]".

#!/usr/bin/python

import fileinput

for line in fileinput.input():
print line[line.rfind("[") + 1:line.rfind("]")]

0
Entering edit mode

thats a very good idea to separate each line in three sections by separating "[" and then grabbing the part 3 (-f3). it worked for most lines in my large test file but few of them have no second section [organism name]. For instance

RecName: Full=Quaternary ammonium compound-resistance protein qacF; AltName: Full=Quaternary ammonium determinant F [qacF]


where I haven't got the gene name instead a blank space.

I was trying another way like looking for [a-zA-Z] in each line with max 5-6 characters as most genes here have 4 characters. therefore the string will skip the large [a-zA-Z1-9] section of organism and grab the gene [ ].

1
Entering edit mode

Then you can use the python script above to grab the content between the last pair of "[ ]". Or use Irsan's solution if you prefer sed. Just remove the first pair of "[ ]" in the sed command.

0
Entering edit mode

excellent...!!!

1
Entering edit mode
9.4 years ago
Irsan ★ 7.6k

This makes me look dizzy but it does the job:

sed 's:.*$.*$.*$$$.*$$$:\1:g' yourFile.txt


gives:

gedP
aucB / ybdE
dacG

0
Entering edit mode

same problem with:

RecName: Full=Quaternary ammonium compound-resistance protein qacF; AltName: Full=Quaternary ammonium determinant F [qacF]

1
Entering edit mode

This will give you the last pair of [ ]:

sed 's:.*$$$.*$$$:\1:g' yourFile.txt

0
Entering edit mode
3.2 years ago
tomluec ▴ 60

Might be a bit overkill, but to generalize the solution you can use an NER model. I wrote a blog post describing how to do this in ~5 minutes.

Named entity recognition is a great way to do this

1. https://spacy.io/ provides an easy python package for building an NER model
2. https://sysrev.com/ gene hunter annotation data to build a quick NER model.
3. http://whichgenesmatter.com/ shows this working on pubmed abstracts.

The blog post is here https://blog.sysrev.com/simple-ner/