Question

script for task or AWK

1

Entering edit mode

5.9 years ago

rob234king ▴ 610

I have tried to break a problem down to two files. I want to take the most common 2nd column value for each sequence (1st column) then have the range of column 3 for those values. Also count the number of markers so the number of Fc06 for seq000001. Any idea how to do this? I was hoping for something easy on linux command line using awk but I just can't see it. The data is cM and chromosome file which I which to summarise.

file1 seq000001 seq000002

file2

  seq000001 Fc06    35.948
    seq000001   Fc06    36.793
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc08    37.009
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.58
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc08    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    38.1
    seq000001   Fc06    38.664
    seq000001   Fc06    38.923
    seq000001   Fc06    38.978
    seq000001   Fc06    39.324
    seq000001   Fc06    40.341
    seq000001   Fc06    40.341
    seq000001   Fc06    40.341
    seq000001   Fc06    40.543
    seq000001   Fc06    40.543
    seq000001   Fc06    40.543
    seq000001   Fc06    40.992
    seq000001   Fc06    41.189
    seq000001   Fc06    41.189
    seq000001   Fc06    41.347
    seq000001   Fc06    41.39
    seq000001   Fc06    41.39
    seq000001   Fc06    41.39
    seq000001   Fc06    41.399
    seq000001   Fc06    41.399
    seq000001   Fc06    41.399
    seq000001   Fc06    41.657
    seq000001   Fc06    41.71
    seq000001   Fc06    41.785
    seq000001   Fc06    41.923
    seq000001   Fc06    42.237
    seq000001   Fc06    42.634
    seq000001   Fc06    42.963
    seq000001   Fc06    43.285
    seq000001   Fc06    43.478
    seq000001   Fc06    43.478
    seq000001   Fc06    43.478
    seq000001   Fc06    43.478
    seq000001   Fc06    43.744
    seq000001   Fc06    43.744
    seq000001   Fc06    45.234
    seq000001   Fc06    45.234
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    51.086
    seq000002   Fc06    63.754
    seq000002   Fc06    63.754
    seq000002   Fc09    13.078
    seq000002   Fc09    13.078
    seq000002   Fc09    13.078
    seq000002   Fc09    16.342
    seq000002   Fc09    16.342
    seq000002   Fc09    16.342
    seq000002   Fc09    16.342
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.806
    seq000002   Fc09    18.544
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc11    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    39.857
    seq000002   Fc09    20.558
    seq000002   Fc09    20.892
    seq000002   Fc09    21.323

Output

seq000001   Fc06    35.948-63.754   72
seq000002   Fc09    13.078-39.857  34

script awk • 1.9k views

ADD COMMENT • link updated 5.9 years ago by dariober 14k • written 5.9 years ago by rob234king ▴ 610

0

Entering edit mode

Hello rob234king ,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY • link 5.9 years ago by GenoMax 141k

score 3 · Accepted Answer · 2018-06-11

3

Entering edit mode

5.9 years ago

cpad0112 21k

 $ datamash -s -g 1,2 min 3 max 3 <test.txt | awk -v OFS="\t" '{print $1,$2,$3"-"$4}'
seq000001   Fc06    35.948-51.086
seq000001   Fc08    37.009-37.916
seq000002   Fc06    63.754-63.754
seq000002   Fc09    13.078-39.857
seq000002   Fc11    19.59-19.59

sudo apt install datamash if you are on ubuntu. Brew, conda has datamash is repos.

ADD COMMENT • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

damn, that easy? I'll download datamash, will it also report how many markers there were, I made a change to what I was looking for above but already something to work with. thanks.

ADD REPLY • link 5.9 years ago by rob234king ▴ 610

0

Entering edit mode

here is the count for markers with range:

$ datamash -s -g 1,2 min 3 max 3 count 3 <test.txt |  awk -v OFS="\t" '{print $1,$2,$3"-"$4,$5}'
seq000001   Fc06    35.948-51.086   72
seq000001   Fc08    37.009-37.916   2
seq000002   Fc06    63.754-63.754   2
seq000002   Fc09    13.078-39.857   34
seq000002   Fc11    19.59-19.59  1

If you want range (max-min):

$ datamash -s -g 1,2 range 3 count 3 <test.txt 
seq000001   Fc06    15.138  72
seq000001   Fc08    0.907   2
seq000002   Fc06    0   2
seq000002   Fc09    26.779  34
seq000002   Fc11    0   1

range = max-min

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

I would only need the lead chromosome for each so below. Guess then for each sequence I could take the highest value in the end column some how. seq000001 Fc06 35.948-51.086 72 seq000002 Fc09 13.078-39.857 34

ADD REPLY • link 5.9 years ago by rob234king ▴ 610

0

Entering edit mode

I can do this last bit in excel. Thanks this saved me so much time trying to get a perl script to dot he same!

ADD REPLY • link 5.9 years ago by rob234king ▴ 610

0

Entering edit mode

$ datamash -s -g 1,2 min 3 max 3 count 3 <test.txt |  awk -v OFS="\t" '{print $1,$2,$3"-"$4,$5}' | sort -rn -k4 
seq000001   Fc06    35.948-51.086   72
seq000002   Fc09    13.078-39.857   34
seq000002   Fc06    63.754-63.754   2
seq000001   Fc08    37.009-37.916   2
seq000002   Fc11    19.59-19.59 1

ADD REPLY • link 5.9 years ago by cpad0112 21k

score 3 · Accepted Answer · 2018-06-11

3

Entering edit mode

5.9 years ago

microfuge ★ 1.9k

While the datamash solution is clearly the best, just for fun an attempt based on perl perl -MList::Util=min,max -lane 'BEGIN{%y}{$x=@F[0]."\t".$F[1];push @{$y{$x}},@F[-1] }END{foreach(keys %y){my @d=@{$y{$_}};print "$_\t",scalar(@d),"\t",min(@d),"--",max(@d) } }' inputfile.txt

seq000002 Fc11 1 19.59--19.59
seq000002 Fc09 34 13.078--39.857
seq000001 Fc08 2 37.009--37.916
seq000001 Fc06 72 35.948--51.086
seq000002 Fc06 2 63.754--63.754

ADD COMMENT • link 5.9 years ago by microfuge ★ 1.9k

0

Entering edit mode

Nice to see a Perl programmer.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

score 3 · Accepted Answer · 2018-06-11

3

Entering edit mode

5.9 years ago

dariober 14k

Another option is to use bedtools groupby:

sort -k1,1 -k2,2n data.txt | groupBy -g 1,2 -c 3,3,3 -o min,max,count

seq000001   Fc06    35.948  51.086  72
seq000001   Fc08    37.009  37.916  2
seq000002   Fc06    63.754  63.754  2
seq000002   Fc09    13.078  39.857  34
seq000002   Fc11    19.59   19.59   1

ADD COMMENT • link 5.9 years ago by dariober 14k

score 2 · Accepted Answer · 2018-06-11

That was a nice monday morning challenge for working on my python skills. The max value for the first sequence is different from your answer template, but consistent with the actual values in file2.

import csv
# open both files and store sequences from file1 as list
stream = open("file1.txt")
seqlist = list(stream)
seqlist = [element.strip() for element in seqlist]

stream = open("file2.txt")
reader = csv.reader(stream, delimiter="\t")

for seq in seqlist:
    # create array from column 2 and 3
    array = ([line[1:] for line in reader if seq == line[0]])
    # find and count most common element from column 2
    common = [element[0] for element in array]
    mostCommon = max(common, key=common.count)
    commonCount = common.count(mostCommon)
    # find range of values for most common element
    values = [element[1] for element in array if element[0] == mostCommon]
    # print out results
    print(seq, "\t", mostCommon, "\t", min(values), "-", max(values), "\t", commonCount)
    stream.seek(0)