Question: script for task or AWK
1
gravatar for rob234king
13 days ago by
rob234king550
UK/Harpenden/Rothamsted Research
rob234king550 wrote:

I have tried to break a problem down to two files. I want to take the most common 2nd column value for each sequence (1st column) then have the range of column 3 for those values. Also count the number of markers so the number of Fc06 for seq000001. Any idea how to do this? I was hoping for something easy on linux command line using awk but I just can't see it. The data is cM and chromosome file which I which to summarise.

file1 seq000001 seq000002

file2

  seq000001 Fc06    35.948
    seq000001   Fc06    36.793
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc08    37.009
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.58
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc08    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    38.1
    seq000001   Fc06    38.664
    seq000001   Fc06    38.923
    seq000001   Fc06    38.978
    seq000001   Fc06    39.324
    seq000001   Fc06    40.341
    seq000001   Fc06    40.341
    seq000001   Fc06    40.341
    seq000001   Fc06    40.543
    seq000001   Fc06    40.543
    seq000001   Fc06    40.543
    seq000001   Fc06    40.992
    seq000001   Fc06    41.189
    seq000001   Fc06    41.189
    seq000001   Fc06    41.347
    seq000001   Fc06    41.39
    seq000001   Fc06    41.39
    seq000001   Fc06    41.39
    seq000001   Fc06    41.399
    seq000001   Fc06    41.399
    seq000001   Fc06    41.399
    seq000001   Fc06    41.657
    seq000001   Fc06    41.71
    seq000001   Fc06    41.785
    seq000001   Fc06    41.923
    seq000001   Fc06    42.237
    seq000001   Fc06    42.634
    seq000001   Fc06    42.963
    seq000001   Fc06    43.285
    seq000001   Fc06    43.478
    seq000001   Fc06    43.478
    seq000001   Fc06    43.478
    seq000001   Fc06    43.478
    seq000001   Fc06    43.744
    seq000001   Fc06    43.744
    seq000001   Fc06    45.234
    seq000001   Fc06    45.234
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    51.086
    seq000002   Fc06    63.754
    seq000002   Fc06    63.754
    seq000002   Fc09    13.078
    seq000002   Fc09    13.078
    seq000002   Fc09    13.078
    seq000002   Fc09    16.342
    seq000002   Fc09    16.342
    seq000002   Fc09    16.342
    seq000002   Fc09    16.342
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.806
    seq000002   Fc09    18.544
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc11    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    39.857
    seq000002   Fc09    20.558
    seq000002   Fc09    20.892
    seq000002   Fc09    21.323

Output

seq000001   Fc06    35.948-63.754   72
seq000002   Fc09    13.078-39.857  34
awk script • 163 views
ADD COMMENTlink modified 13 days ago by dariober9.2k • written 13 days ago by rob234king550

Hello rob234king ,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink written 13 days ago by genomax50k
3
gravatar for cpad0112
13 days ago by
cpad01126.4k
cpad01126.4k wrote:
 $ datamash -s -g 1,2 min 3 max 3 <test.txt | awk -v OFS="\t" '{print $1,$2,$3"-"$4}'
seq000001   Fc06    35.948-51.086
seq000001   Fc08    37.009-37.916
seq000002   Fc06    63.754-63.754
seq000002   Fc09    13.078-39.857
seq000002   Fc11    19.59-19.59

sudo apt install datamash if you are on ubuntu. Brew, conda has datamash is repos.

ADD COMMENTlink modified 13 days ago • written 13 days ago by cpad01126.4k

damn, that easy? I'll download datamash, will it also report how many markers there were, I made a change to what I was looking for above but already something to work with. thanks.

ADD REPLYlink written 13 days ago by rob234king550

here is the count for markers with range:

$ datamash -s -g 1,2 min 3 max 3 count 3 <test.txt |  awk -v OFS="\t" '{print $1,$2,$3"-"$4,$5}'
seq000001   Fc06    35.948-51.086   72
seq000001   Fc08    37.009-37.916   2
seq000002   Fc06    63.754-63.754   2
seq000002   Fc09    13.078-39.857   34
seq000002   Fc11    19.59-19.59  1

If you want range (max-min):

$ datamash -s -g 1,2 range 3 count 3 <test.txt 
seq000001   Fc06    15.138  72
seq000001   Fc08    0.907   2
seq000002   Fc06    0   2
seq000002   Fc09    26.779  34
seq000002   Fc11    0   1

range = max-min

ADD REPLYlink modified 13 days ago • written 13 days ago by cpad01126.4k

I would only need the lead chromosome for each so below. Guess then for each sequence I could take the highest value in the end column some how. seq000001 Fc06 35.948-51.086 72 seq000002 Fc09 13.078-39.857 34

ADD REPLYlink written 13 days ago by rob234king550

I can do this last bit in excel. Thanks this saved me so much time trying to get a perl script to dot he same!

ADD REPLYlink written 13 days ago by rob234king550
$ datamash -s -g 1,2 min 3 max 3 count 3 <test.txt |  awk -v OFS="\t" '{print $1,$2,$3"-"$4,$5}' | sort -rn -k4 
seq000001   Fc06    35.948-51.086   72
seq000002   Fc09    13.078-39.857   34
seq000002   Fc06    63.754-63.754   2
seq000001   Fc08    37.009-37.916   2
seq000002   Fc11    19.59-19.59 1
ADD REPLYlink written 13 days ago by cpad01126.4k
3
gravatar for microfuge
13 days ago by
microfuge900
microfuge900 wrote:

While the datamash solution is clearly the best, just for fun an attempt based on perl perl -MList::Util=min,max -lane 'BEGIN{%y}{$x=@F[0]."\t".$F[1];push @{$y{$x}},@F[-1] }END{foreach(keys %y){my @d=@{$y{$_}};print "$_\t",scalar(@d),"\t",min(@d),"--",max(@d) } }' inputfile.txt

  • seq000002 Fc11 1 19.59--19.59
  • seq000002 Fc09 34 13.078--39.857
  • seq000001 Fc08 2 37.009--37.916
  • seq000001 Fc06 72 35.948--51.086
  • seq000002 Fc06 2 63.754--63.754
ADD COMMENTlink modified 13 days ago • written 13 days ago by microfuge900

Nice to see a Perl programmer.

ADD REPLYlink written 13 days ago by Kevin Blighe21k
3
gravatar for dariober
13 days ago by
dariober9.2k
Glasgow - UK
dariober9.2k wrote:

Another option is to use bedtools groupby:

sort -k1,1 -k2,2n data.txt | groupBy -g 1,2 -c 3,3,3 -o min,max,count

seq000001   Fc06    35.948  51.086  72
seq000001   Fc08    37.009  37.916  2
seq000002   Fc06    63.754  63.754  2
seq000002   Fc09    13.078  39.857  34
seq000002   Fc11    19.59   19.59   1
ADD COMMENTlink written 13 days ago by dariober9.2k
2
gravatar for Lisa Ha
13 days ago by
Lisa Ha40
Germany
Lisa Ha40 wrote:

That was a nice monday morning challenge for working on my python skills. The max value for the first sequence is different from your answer template, but consistent with the actual values in file2.

import csv
# open both files and store sequences from file1 as list
stream = open("file1.txt")
seqlist = list(stream)
seqlist = [element.strip() for element in seqlist]

stream = open("file2.txt")
reader = csv.reader(stream, delimiter="\t")

for seq in seqlist:
    # create array from column 2 and 3
    array = ([line[1:] for line in reader if seq == line[0]])
    # find and count most common element from column 2
    common = [element[0] for element in array]
    mostCommon = max(common, key=common.count)
    commonCount = common.count(mostCommon)
    # find range of values for most common element
    values = [element[1] for element in array if element[0] == mostCommon]
    # print out results
    print(seq, "\t", mostCommon, "\t", min(values), "-", max(values), "\t", commonCount)
    stream.seek(0)
ADD COMMENTlink written 13 days ago by Lisa Ha40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 964 users visited in the last hour