Question: How to remove characters after specific symbol from all columns, and make all charachters sperated by space and comma except the first column
0
gravatar for haneenih7
4 weeks ago by
haneenih770
KAUST
haneenih770 wrote:

Hi all,

I am trying to modify the formate of a big file:

The file is tab-delimited Here how the file looks like:

AB11.1  CB:0078_0.53    CB:0044464_0.42   CB:0005623_0.466
AB10.1  
AB01.2  CB:0036_0.4   CB:0003824_0.4       CB:0005575_0.7    CB:0005622_0.2 CB:0005623_0.6
AB01.2  CB:0036_0.3   CB:0003824_0.43      CB:0005575_0.7    CB:0005622_0.1

Please note that the number of columns for each row is not identical. The number of columns can be more than 400 or it can be only 1, and some few rows are empty like for the ID: AB10.1

I want to modify the formate first by removing all characters that come after this symbol _ including the symbol itself. Then modify the separators:

1- Only after the first column it is separated by tab-delimited

2- Starting from the second till the last column they should be separated by a comma and then space

So output file should look like this:

AB11.1    CB:0078, CB:0044464, CB:0005623
AB10.1  
AB01.2    CB:0036, CB:0003824, CB:0005575, CB:0005622, CB:0005623
AB01.2    CB:0036, CB:0003824, CB:0005575, CB:0005622

How to do that in a bash script (I have super basic knowledge)? or maybe python (never used it)?

bash • 158 views
ADD COMMENTlink modified 4 weeks ago by RamRS27k • written 4 weeks ago by haneenih770
0
gravatar for RamRS
4 weeks ago by
RamRS27k
Houston, TX
RamRS27k wrote:

Use sed for requirement 1. You want to remove all _\S+ (or if your format only has numbers and . following underscore, remove all _[0-9_]+.

Use awk or perl for the second requirement. It will be a bit tricky (you may have to loop from 2 to NF), but it will be easier than using R or learning python.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by RamRS27k

Yes, I managed to do it with awk and sed;

To remove the last 6 characters from a file in each column awk '{for(i=1;i<=NF;i++) sub(/......$/,X,$i)}1'

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by haneenih770

That assumed you'll need to remove exactly 6 characters from each field, which doesn't seem to be the case. Please be careful with such assumptions.

ADD REPLYlink written 4 weeks ago by RamRS27k

Due to this, first column will be removed as there are 6 characters only.

ADD REPLYlink written 4 weeks ago by cpad011213k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 982 users visited in the last hour