select the highest value for the same gene
1
0
Entering edit mode
7.2 years ago
Lila M ★ 1.2k

Hi every body,

I have the coverage for different genes as:

GENE COVERAGE
A     0.7
A     0.2
A     0.9
B     0.5
B     1.2
B     0.3
B     0.5
B     0.6
C     0.1

and I want to get ONLY the highest coverage for each gene as follow

GENE COVERAGE
  A     0.9
  B     1.2
  C     0.1

Because I need the most representative to calculate the density of each one. Any suggestion?

PS. I can't install mySQL in the computer, I know that it will be the best option...

Thanks!

gene ChIP-Seq • 1.2k views
ADD COMMENT
1
Entering edit mode
7.2 years ago

assuming that tabulation is the delimiter:

sed 's/^GENE/#GENE/' input.txt | LC_ALL=C sort -t $'\t' -k1,1 -k2,2rg  | LC_ALL=C sort -t $'\t' -k1,1 --stable -u
  • the first sed is just here to be sure that the header will be the first row in the final output
  • sort on first column , and then second column, numeric, reverse order
  • sort on first column, keep unique, don't mess the input order
 #GENE  COVERAGE
 A  0.9
 B  1.2
 C  0.1
ADD COMMENT
0
Entering edit mode

can you explain a bit the code please?

Thank you!

ADD REPLY
0
Entering edit mode

I am not very good using this language, it seems quite easy to use, for example, if I want to order based on different colunm for example 7 and 11 I can try:

sed 's/^GENE/#GENE/' gene_body_annot_coverage | LC_ALL=C sort -t $'\t ' -k7,7 -k11,11rg  | LC_ALL=C sort -t $'\t' -k7,7 --stable -u > output

But can you explain what LC_ALL=C means?

Thanks!

ADD REPLY
1
Entering edit mode

this 'language' is bash.

The column are specified by the '-k' option. So it would be (....) -k7,7 -k11,11rg (...)

ADD REPLY
0
Entering edit mode

yes, sorry about that, I've just figure it out very quickly ! Nice approach to handling files :)

ADD REPLY

Login before adding your answer.

Traffic: 2552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6