Question

Renaming column of taxonomic annotation file

0

Entering edit mode

17 months ago

bionix ▴ 10

Hello,

I have hundreds of taxonomic annotation files. The 4th column of the files is the taxonomic rank. The first few rows of the files look like this:

6.46    387327  387327  U   0   unclassified
93.54   5610481 488 R   1   root
93.53   5609584 75743   R1  2     d__Bacteria
43.31   2597449 11790   R2  18      p__Actinobacteriota
23.04   1382144 149 R3  19        c__Actinobacteria
22.98   1378342 590 R4  20          o__Actinomycetales
22.55   1352503 35  R5  21            f__Bifidobacteriaceae
22.54   1352180 54264   R6  22              g__Bifidobacterium
9.43    565635  565635  R7  1797                  s__Bifidobacterium adolescentis

R1=Domain, R2=Phylum, R3=Class, R4=Order, R5=Family, R6=Genus, and R7=Species.

I want to change the R[2-7] with the uppercase first letter of the respective taxonomic order. Additionally, I also want to remove the double underscores and their prefixes before the taxonomic names to make it usable for another tool. The desired output should look like this:

6.46    387327  387327  U   0   unclassified
93.54   5610481 488 R   1   root
93.53   5609584 75743   R1  2     Bacteria
43.31   2597449 11790   P   18      Actinobacteriota
23.04   1382144 149 C   19        Actinobacteria
22.98   1378342 590 O   20          Actinomycetales
22.55   1352503 35  F   21            Bifidobacteriaceae
22.54   1352180 54264   G   22              Bifidobacterium
9.43    565635  565635  S   1797                  Bifidobacterium adolescentis

Please note that Bacteria (domain) would still have the R1 value. Since I have hundreds of such files, it's quite difficult to make the changes in excel or any text editor.

Could you please suggest a better option?

Many thanks for your time and help!

Column rank manipulation Taxonomic annotation • 670 views

ADD COMMENT • link 17 months ago by bionix ▴ 10

score 1 · Answer 1 · 2022-11-09

If your file is called input.txt, you can use:

sed '/R[2-7]/{h;s/.*\([a-z]\)__.*/\u\1/g;x;G;s/\n/ /g};s/[a-z]__//g' < input.txt | awk '{if($4 ~ /R[2-7]/) {$4=$NF;$NF=""; NF--; print} else {print $0}}' > output.txt

It's just the quick and dirty approach - the purists would probably have found a way to do it entirely in either sed or awk.

What is being done the sed part:

/R[2-7]/ matches lines that contain this pattern, so the following operations are restricted to those lines:
h copies the pattern space to the hold space, while s/.*$[a-z]$__.*/\u\1/g extracts the letter that precedes __ and converts it to uppercase.
x;G;s/\n/ /g switches hold and pattern space, appends the hold space to the pattern space and replaces an introduced newline character.
On all lines, s/[a-z]__//g replaces the prefix of your species names.

The result is a file that has an extra column at the end of the lines where we want to preserve the information from the prefix. Now awk is used, because it has a good support to restrict actions to specific columns.

if($4 ~ /R[2-7]/) is a condition that checks if the fourth column is anything between R2-R7. If no, the line is printed as it is with print $0
For matching lines {$4=$NF;$NF=""; NF--; print}, the fourth column is replaced with the contents of the extra column we created with sed. NF is a default variable that corresponds to the number of fields, so $NF always denotes the last field. Subsequently, that temporary extra field $NF=""; NF-- is dropped again, and the line is printed.

I hope, this is what you wanted. In any case, you can customize the command to your needs of course.

PS: If you have a Mac, you need to install gsed, since Mac's default doesn't support \u for the case conversion. Alternatively, put $4=toupper($NF) in the awk part.