Question

Filtering Micrornas Data With Awk

0

Entering edit mode

12.4 years ago

Chuangye ▴ 80

chr17    10558721    10558801    ssc-mir-486-1    +    0.475    chr17    10558763    10558830    ssc-mir-486    +

chr17    33406488    33406568    ssc-mir-103-2    -    1.7375    chr17    33406491    33406569    ssc-mir-103a-2    -

chr17    33406488    33406568    ssc-mir-103-2    -    1.7375    chr17    33406499    33406561    ssc-mir-103b-2    +

chr17    40405261    40405341    ssc-mir-499    -    1.9125    chr17    40405237    40405359    ssc-mir-499a    -

chr17    40405261    40405341    ssc-mir-499    -    1.9125    chr17    40405262    40405335    ssc-mir-499b    +

chr17    61587157    61587234    ssc-mir-296    -    0.987012987012987    chr17    61587158    61587236    ssc-mir-296    -

chrX    58683649    58683729    ssc-mir-374b    +    1.725    chrX    58683647    58683717    ssc-mir-374c    -

chrX    58683649    58683729    ssc-mir-374b    +    1.725    chrX    58683647    58683719    ssc-mir-374b    +

The result should be:

chr17    33406488    33406568    ssc-mir-103-2    -    1.7375    chr17    33406499    33406561    ssc-mir-103b-2    +
chr17    40405261    40405341    ssc-mir-499    -    1.9125    chr17    40405262    40405335    ssc-mir-499b    +
chrX    58683649    58683729    ssc-mir-374b    +   1.725   chrX    58683647    58683717    ssc-mir-374c    -

My scripts:

awk 'BEGIN{OFS="\t"}(!($5==$11)&&($6>1))' intersect.txt

Or

awk 'BEGIN{OFS="\t"}($5!=$11 && $6>1)' intersect.txt

And the answer is:

chr17    33406488    33406568    ssc-mir-103-2    -    1.7375    chr17    33406499    33406561    ssc-mir-103b-2    +

chr17    40405261    40405341    ssc-mir-499    -    1.9125    chr17    40405262    40405335    ssc-mir-499b    +

chrX    58683649    58683729    ssc-mir-374b    +    1.725    chrX    58683647    58683717    ssc-mir-374c    -

chrX    58683649    58683729    ssc-mir-374b    +    1.725    chrX    58683647    58683719    ssc-mir-374b    +

So why the scripts couldn't get the right answer ? And how to cope it with unix or python scripts?

unix awk • 2.4k views

ADD COMMENT • link updated 12.4 years ago by W Langdon ▴ 90 • written 12.4 years ago by Chuangye ▴ 80

0

Entering edit mode

What are you trying to produce? The script right now check to see if the strands are the same and if the 6th column is greater than 1. Why are you expecting only chr17 to be produced?

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

Thank you! I have corrected the error.

ADD REPLY • link 12.4 years ago by Chuangye ▴ 80

score 3 · Answer 1 · 2011-11-27

3

Entering edit mode

12.4 years ago

Pierre Lindenbaum 161k

If you want the get the lines have where the strands are not the same and where the value in the 6th column is greater than 1.0:

 $ awk -F '     '  '($5!=$11 && $6>1.0)' input.txt 
chr17   33406488    33406568    ssc-mir-103-2   -   1.7375  chr17   33406499    33406561    ssc-mir-103b-2  +
chr17   40405261    40405341    ssc-mir-499 -   1.9125  chr17   40405262    40405335    ssc-mir-499b    +
chrX    58683649    58683729    ssc-mir-374b    +   1.725   chrX    58683647    58683717    ssc-mir-374c    -

I don't know why your script doesn't work.

ADD COMMENT • link 12.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, you are right.Thank you very much!

ADD REPLY • link 12.4 years ago by Chuangye ▴ 80

0

Entering edit mode

Using the scripts as awk 'BEGIN{OFS="\t"}($6!=$14 && $8>1)'

or

awk -F ' ' '($6!=$14 && $8>1.0)' could not effective to get the lines their strands are not the same and in which the value in the 6th column is greater than 1.0. such as the data "intersect",which temporarily deposited at http://www.rayfile.com/zh-cn/files/0a846566-1954-11e1-95c1-0015c55db73d/cbbffc68/.

I don't kwnow where is the problem.

ADD REPLY • link 12.4 years ago by Chuangye ▴ 80

Ram · Answer 2 · 2011-11-28

2

Entering edit mode

12.4 years ago

W Langdon ▴ 90

It appears there may be a problem with assigning OFS inside BEGIN. http://lists.gnu.org/archive/html/bug-gnu-utils/2011-03/msg00006.html

However, why do you want to set OFS to tab? By default gawk will split input lines on white space (which includes tabs).

I have had problems with tabs before. its often safer to either works with defaults or find some other way to parse the input (eg comma separated data).

Bill

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.4 years ago by W Langdon ▴ 90

1

Entering edit mode

Nice catch on the tabs, I am very surprised by it as well. Letting awk split on any whitespace can actually lead to very surprising outcomes. In general the default split in most programming languages (perl, python) will collapse consecutive whitespaces and treat them as a single separator. Therefore empty, tab separated columns will shift subsequent columns. Once you are bitten by one of the devious tab shifting default you never rely on them again.

ADD REPLY • link 12.4 years ago by Istvan Albert 100k