Renaming column of taxonomic annotation file
1
0
Entering edit mode
17 months ago
bionix ▴ 10

Hello,

I have hundreds of taxonomic annotation files. The 4th column of the files is the taxonomic rank. The first few rows of the files look like this:

6.46    387327  387327  U   0   unclassified
93.54   5610481 488 R   1   root
93.53   5609584 75743   R1  2     d__Bacteria
43.31   2597449 11790   R2  18      p__Actinobacteriota
23.04   1382144 149 R3  19        c__Actinobacteria
22.98   1378342 590 R4  20          o__Actinomycetales
22.55   1352503 35  R5  21            f__Bifidobacteriaceae
22.54   1352180 54264   R6  22              g__Bifidobacterium
9.43    565635  565635  R7  1797                  s__Bifidobacterium adolescentis

R1=Domain, R2=Phylum, R3=Class, R4=Order, R5=Family, R6=Genus, and R7=Species.

I want to change the R[2-7] with the uppercase first letter of the respective taxonomic order. Additionally, I also want to remove the double underscores and their prefixes before the taxonomic names to make it usable for another tool. The desired output should look like this:

6.46    387327  387327  U   0   unclassified
93.54   5610481 488 R   1   root
93.53   5609584 75743   R1  2     Bacteria
43.31   2597449 11790   P   18      Actinobacteriota
23.04   1382144 149 C   19        Actinobacteria
22.98   1378342 590 O   20          Actinomycetales
22.55   1352503 35  F   21            Bifidobacteriaceae
22.54   1352180 54264   G   22              Bifidobacterium
9.43    565635  565635  S   1797                  Bifidobacterium adolescentis

Please note that Bacteria (domain) would still have the R1 value. Since I have hundreds of such files, it's quite difficult to make the changes in excel or any text editor.

Could you please suggest a better option?

Many thanks for your time and help!

Column rank manipulation Taxonomic annotation • 670 views
ADD COMMENT
1
Entering edit mode
17 months ago

If your file is called input.txt, you can use:

sed '/R[2-7]/{h;s/.*\([a-z]\)__.*/\u\1/g;x;G;s/\n/ /g};s/[a-z]__//g' < input.txt | awk '{if($4 ~ /R[2-7]/) {$4=$NF;$NF=""; NF--; print} else {print $0}}' > output.txt

It's just the quick and dirty approach - the purists would probably have found a way to do it entirely in either sed or awk.

What is being done the sed part:

  • /R[2-7]/ matches lines that contain this pattern, so the following operations are restricted to those lines:
  • h copies the pattern space to the hold space, while s/.*\([a-z]\)__.*/\u\1/g extracts the letter that precedes __ and converts it to uppercase.
  • x;G;s/\n/ /g switches hold and pattern space, appends the hold space to the pattern space and replaces an introduced newline character.
  • On all lines, s/[a-z]__//g replaces the prefix of your species names.

The result is a file that has an extra column at the end of the lines where we want to preserve the information from the prefix. Now awk is used, because it has a good support to restrict actions to specific columns.

  • if($4 ~ /R[2-7]/) is a condition that checks if the fourth column is anything between R2-R7. If no, the line is printed as it is with print $0
  • For matching lines {$4=$NF;$NF=""; NF--; print}, the fourth column is replaced with the contents of the extra column we created with sed. NF is a default variable that corresponds to the number of fields, so $NF always denotes the last field. Subsequently, that temporary extra field $NF=""; NF-- is dropped again, and the line is printed.

I hope, this is what you wanted. In any case, you can customize the command to your needs of course.

PS: If you have a Mac, you need to install gsed, since Mac's default doesn't support \u for the case conversion. Alternatively, put $4=toupper($NF) in the awk part.

ADD COMMENT
0
Entering edit mode

Matthias Zepper thank you very much for the solution and explaining it to me. I tried it on the Linux system. The command line is doing its job, but after the conversion (i.e., R# to P/C/O..etc.) 3rd row onwards all columns are becoming space separated. Is there a way to use tab as a column separator (3rd row onwards)?

Many thanks!

ADD REPLY
1
Entering edit mode

Yes. You can specify the output separator in awk with OFS. Add a BEGIN statement to the awk command like so:

awk 'BEGIN{OFS="\t"}; ...}' The rest of the command is unchanged.

ADD REPLY
1
Entering edit mode

Thanks a lot, Matthias Zepper, it solved my problem.

ADD REPLY

Login before adding your answer.

Traffic: 2589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6