Question: [SOLVED] manipulating text in linux enviroment
0
gravatar for theodore
4.9 years ago by
theodore40
Germany
theodore40 wrote:

Dear all.

 

I do not know if a tool already exists but I would like to do the following steps in a tab delimited file:

The file is as follows:

chr1 0 600 15 Repetitive/CNV 0 . 0 600 245,245,245
chr1 1000 1600 8 Insulator 0 . 1000 1600 10,190,254
chr1 100004000 100005200 2 Weak Promoter 0 . 100004000 100005200 255,105,105
chr1 100005200 100016800 13 Heterochrom/lo 0 . 100005200 100016800 245,245,245
chr1 10001600 10014800 13 Heterochrom/lo 0 . 10001600 10014800 245,245,245
chr1 100016800 100022800 12 Repressed 0 . 100016800 100022800 127,127,127
chr1 100022800 100026800 13 Heterochrom/lo 0 . 100022800 100026800 245,245,245
chr1 100026800 100028600 12 Repressed 0 . 100026800 100028600 127,127,127
chr1 100028600 100037000 13 Heterochrom/lo 0 . 100028600 100037000 245,245,245
chr1 100037000 100046600 12 Repressed 0 . 100037000 100046600 127,127,127
chr1 100046600 100046800 6 Weak Enhancer 0 . 100046600 100046800 255,252,4
chr1 100046800 100047000 2 Weak Promoter 0 . 100046800 100047000 255,105,105
chr1 100047000 100047200 4 Strong Enhancer 0 . 100047000 100047200 250,202,0
chr1 100047200 100047400 6 Weak Enhancer 0 . 100047200 100047400 255,252,4
chr1 100047400 100054200 13 Heterochrom/lo 0 . 100047400 100054200 245,245,245
chr1 100054200 100055000 12 Repressed 0 . 100054200 100055000 127,127,127
chr1 100055000 100087400 13 Heterochrom/lo 0 . 100055000 100087400 245,245,245
chr1 100087400 100087600 6 Weak Enhancer 0 . 100087400 100087600 255,252,4

first I would like to remove the number space before the characterization of the area: 6 Weak Enchancer ---> Weak Enchancer second to count all Weak enchancer or other identical fields of row 4 and print something like the following: Weak Enchancer 4 Heterochrom 20 . . . I tried: sort 'file.bed' | awk '{print $4}' | uniq -c -D -i or sort 'file.bed' | uniq -c -D -i with no avail. Any help will be higly appreciated

 

I should state that I want to do it as easily as possible, I have no real skills in programming and even if openoffice can do it I'm fine with that!!!

Thank you in advance

Theodore

ADD COMMENTlink modified 23 months ago by Biostar ♦♦ 20 • written 4.9 years ago by theodore40
4
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

cut -f 4 file.bed | sed 's/[0-9]* //' | sort | uniq -c

or something along those lines.

Edit: For the sake of clarity, the cut -f 4 portion extracts column 4 and the sed command just replaces a number followed by a space with nothing (i.e., it removes that pattern).

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Devon Ryan88k

I've run the pipeline, it works great, although I get the following:

45 10_Txn_Elongation
    133 11_Weak_Txn
     95 12_Repressed

it seems as if sed had replaced spaces (\s) with underscore (_)???

 

ADD REPLYlink written 4.9 years ago by theodore40

It shouldn't do that, at least not unless you changed it to be something like sed 's/ /_/'.

ADD REPLYlink written 4.9 years ago by Devon Ryan88k
1
gravatar for chefer
4.9 years ago by
chefer270
Pretoria, ZA
chefer270 wrote:

Using your original file, you can do the unique count (col 5 in the example) like this:

cut -f 5 test.bed | sort | uniq -c

Which gives you this:

6 Heterochrom/lo
1 Insulator
1 Repetitive/CNV
4 Repressed
1 Strong Enhancer
3 Weak Enhancer
2 Weak Promoter

edit: I misunderstood the example format, this will not work on the input dataset. You should cut on column 4, and then do the replace as suggested by the upvoted post.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by chefer270
1

It's understandably unclear from the question, but 2 Weak Promoter appears to be an example of a single column value, rather than being split into two columns. It appears you split things into different columns when you made the file on your local system.

ADD REPLYlink written 4.9 years ago by Devon Ryan88k

It is a tab delimited file the 4th row consists of (number)(space)(description). The description is with space values.

I do not know how to make a tab look like in a copy paste manner.

ADD REPLYlink written 4.9 years ago by theodore40

Some ways to paste or represent tab in the terminal discussed here.

ADD REPLYlink written 4.9 years ago by Neilfws48k

I'll try it tomorrow. Thank you

ADD REPLYlink written 4.9 years ago by theodore40
0
gravatar for theodore
4.9 years ago by
theodore40
Germany
theodore40 wrote:

cut -f 9 | sed 's/[0-9]* //' | sort | uniq -c | sed 's/* //' | awk '{print $2,$3"\t"$1}'

I got it thanks to your recommendations.

The above command/pipeline worked miracles for me

ADD COMMENTlink written 4.9 years ago by theodore40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 787 users visited in the last hour