Probably this is not 100% bioinformatics problem that we supposed to discuss here but I just got stuck with this general scripting problem and thought to get a quick solution. Please guide me a bit.
Description below is a column in my large txt file. I have been trying to extract last sections of the description line in the [ ] (e.g. gedP, aucB / ybdE etc). In every row, there is no specific pattern that I can see except some text text text [organism name] [gene name]
, (looks like these 3 sections are space separated) so I need to search for gene name in second [ ], line by line and extract the gene name (e.g. gedP).
I could do it easily if they were tab separated but space separated creates problem for me (still learning !!) but trying to solve it. Any suggestions?
bacterial transcriptional activator protein GedP [Escherichia coli MS 182-1] [gedP]
large cation system protein AucB [Escherichia coli UTI89] [aucB / ybdE]
DacG gamma [uncultured bacterium] [dacG]
RecName: Full=Quaternary ammonium compound-resistance protein qacF; AltName: Full=Quaternary ammonium determinant F [qacF]
thats a very good idea to separate each line in three sections by separating "[" and then grabbing the part 3 (-f3). it worked for most lines in my large test file but few of them have no second section [organism name]. For instance
where I haven't got the gene name instead a blank space.
I was trying another way like looking for [a-zA-Z] in each line with max 5-6 characters as most genes here have 4 characters. therefore the string will skip the large [a-zA-Z1-9] section of organism and grab the gene [ ].
Then you can use the python script above to grab the content between the last pair of "[ ]". Or use Irsan's solution if you prefer sed. Just remove the first pair of "[ ]" in the sed command.
excellent...!!!