Question: bash find consecutive acc. no. & insert text
0
gravatar for 6schulte
13 days ago by
6schulte20
6schulte20 wrote:

I have a table that I fill with taxonomy information. This information I am requesting from NCBI using efetch. How I do it is described here: C: How to get summary for acc.no. not starting with 'WP_' ?.

Now I want to use a bash-line command to find two consecutive accession numbers. They should not appear but if NCBI doesn't recognize the acc. no. I am providing or the connection to the server is lost, it will appear in my file as epost will jump to the next acc. no. to work with instead of finishing the line properly.

What I am currently trying to do is find a pattern (composed of two consecutive accession numbers) and insert a new line in the middle of it.

sed -e 's/^[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t/[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t\n[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t/g' $2 > text_test #the replacement does not work properly

[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*is the pattern describing the structure of the acc. no. as they are 3 characters + 5 numbers or 3 characters + 7 numbers.

..................................................................................................................................................................

..................................................................................................................................................................

I know this is not very reader friendly so I will split the code into parts and explain my thoughts:

sed -e 's/^[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t   (...)

This first part tells the computer what I want to find.

(...)   /[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t\n[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t/g'

This second part tells the computer what I want to put there.

(...)   $2 > text_test #the replacement does not work properly

This last part only states where to look and where to write the result to.

The second part is causing the problems. It does not write the acc. no. as they are found in the file but instead it writes the regex [a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]* as a string in the file.

..................................................................................................................................................................

..................................................................................................................................................................

Example for file content:

WP_112675856    Micromonospora saelicesensis    Bacteria; Actinobacteria; Micromonosporales; Micromonosporaceae; Micromonospora
CCH19814    RSN10899    Streptomyces sp. WAC 05977  Bacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces

With CCH19814 RSN10899 being the pattern to split.

Desired result:

WP_112675856    Micromonospora saelicesensis    Bacteria; Actinobacteria; Micromonosporales; Micromonosporaceae; Micromonospora
CCH19814    
RSN10899    Streptomyces sp. WAC 05977  Bacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces

Or even:

WP_112675856    Micromonospora saelicesensis    Bacteria; Actinobacteria; Micromonosporales; Micromonosporaceae; Micromonospora
CCH19814    -    -
RSN10899    Streptomyces sp. WAC 05977  Bacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces
acc.no. bash efetch ncbi • 91 views
ADD COMMENTlink modified 13 days ago • written 13 days ago by 6schulte20
1

Not related to the problem, but you can tidy your regexes up considerably by using counts after your character classes, e.g. [A-Z][A-Z][A-Z] can simply be [A-Z]{3}.

ADD REPLYlink written 13 days ago by Joe18k

I also thought about approaching this by using awk and checking if there is content in a fourth column.

-> If everything goes right the file should have three entries for each line, separated by a tab.

In the case of two consecutive acc. no. we have four entries for that line, therefore four tabs in that line.

ADD REPLYlink written 13 days ago by 6schulte20
1
gravatar for 6schulte
13 days ago by
6schulte20
6schulte20 wrote:

I have a solution:

number=$(awk '/[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*/{ print NR; exit }' $2)

sed -e "$number s/\t/\n/" $2 > text_test

This can run in a loop.

ADD COMMENTlink modified 13 days ago • written 13 days ago by 6schulte20

My previous version was only half way functioning:

This

awk '/[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*\t[a-zA-Z][a-zA-Z][a-zA-Z_][0-9][0-9][0-9][0-9][0-9]*/{ print NR; exit }' $2 > number

tells me in which line two acc. no. occur after one another.

In my case it is in the second line. This will replace the first occurrence of a tab in the second line with a new line:

sed -e "2s/\t/\n/" $2 > text_test #(1)

I'd like to dynamically input the line number though, like:

sed -e "$number s/\t/\n/" $2 > text_test #(1)

or

sed -e "$($number)s/\t/\n/" $2 > text_test

But that does not work, so I modified it (see solution above)

...

=================================

(1) The same could be achieved using an awk-version.

(2)Will unfortunatly insert a new line in both the first and the second line.

ADD REPLYlink modified 13 days ago • written 13 days ago by 6schulte20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2521 users visited in the last hour
_