script for task or AWK
4
1
Entering edit mode
5.9 years ago
rob234king ▴ 610

I have tried to break a problem down to two files. I want to take the most common 2nd column value for each sequence (1st column) then have the range of column 3 for those values. Also count the number of markers so the number of Fc06 for seq000001. Any idea how to do this? I was hoping for something easy on linux command line using awk but I just can't see it. The data is cM and chromosome file which I which to summarise.

file1 seq000001 seq000002

file2

  seq000001 Fc06    35.948
    seq000001   Fc06    36.793
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc06    37.009
    seq000001   Fc08    37.009
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.368
    seq000001   Fc06    37.58
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc08    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    37.916
    seq000001   Fc06    38.1
    seq000001   Fc06    38.664
    seq000001   Fc06    38.923
    seq000001   Fc06    38.978
    seq000001   Fc06    39.324
    seq000001   Fc06    40.341
    seq000001   Fc06    40.341
    seq000001   Fc06    40.341
    seq000001   Fc06    40.543
    seq000001   Fc06    40.543
    seq000001   Fc06    40.543
    seq000001   Fc06    40.992
    seq000001   Fc06    41.189
    seq000001   Fc06    41.189
    seq000001   Fc06    41.347
    seq000001   Fc06    41.39
    seq000001   Fc06    41.39
    seq000001   Fc06    41.39
    seq000001   Fc06    41.399
    seq000001   Fc06    41.399
    seq000001   Fc06    41.399
    seq000001   Fc06    41.657
    seq000001   Fc06    41.71
    seq000001   Fc06    41.785
    seq000001   Fc06    41.923
    seq000001   Fc06    42.237
    seq000001   Fc06    42.634
    seq000001   Fc06    42.963
    seq000001   Fc06    43.285
    seq000001   Fc06    43.478
    seq000001   Fc06    43.478
    seq000001   Fc06    43.478
    seq000001   Fc06    43.478
    seq000001   Fc06    43.744
    seq000001   Fc06    43.744
    seq000001   Fc06    45.234
    seq000001   Fc06    45.234
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    46.272
    seq000001   Fc06    51.086
    seq000002   Fc06    63.754
    seq000002   Fc06    63.754
    seq000002   Fc09    13.078
    seq000002   Fc09    13.078
    seq000002   Fc09    13.078
    seq000002   Fc09    16.342
    seq000002   Fc09    16.342
    seq000002   Fc09    16.342
    seq000002   Fc09    16.342
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.004
    seq000002   Fc09    17.806
    seq000002   Fc09    18.544
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    19.59
    seq000002   Fc11    19.59
    seq000002   Fc09    19.59
    seq000002   Fc09    39.857
    seq000002   Fc09    20.558
    seq000002   Fc09    20.892
    seq000002   Fc09    21.323

Output

seq000001   Fc06    35.948-63.754   72
seq000002   Fc09    13.078-39.857  34
script awk • 1.9k views
ADD COMMENT
0
Entering edit mode

Hello rob234king ,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY
3
Entering edit mode
5.9 years ago
 $ datamash -s -g 1,2 min 3 max 3 <test.txt | awk -v OFS="\t" '{print $1,$2,$3"-"$4}'
seq000001   Fc06    35.948-51.086
seq000001   Fc08    37.009-37.916
seq000002   Fc06    63.754-63.754
seq000002   Fc09    13.078-39.857
seq000002   Fc11    19.59-19.59

sudo apt install datamash if you are on ubuntu. Brew, conda has datamash is repos.

ADD COMMENT
0
Entering edit mode

damn, that easy? I'll download datamash, will it also report how many markers there were, I made a change to what I was looking for above but already something to work with. thanks.

ADD REPLY
0
Entering edit mode

here is the count for markers with range:

$ datamash -s -g 1,2 min 3 max 3 count 3 <test.txt |  awk -v OFS="\t" '{print $1,$2,$3"-"$4,$5}'
seq000001   Fc06    35.948-51.086   72
seq000001   Fc08    37.009-37.916   2
seq000002   Fc06    63.754-63.754   2
seq000002   Fc09    13.078-39.857   34
seq000002   Fc11    19.59-19.59  1

If you want range (max-min):

$ datamash -s -g 1,2 range 3 count 3 <test.txt 
seq000001   Fc06    15.138  72
seq000001   Fc08    0.907   2
seq000002   Fc06    0   2
seq000002   Fc09    26.779  34
seq000002   Fc11    0   1

range = max-min

ADD REPLY
0
Entering edit mode

I would only need the lead chromosome for each so below. Guess then for each sequence I could take the highest value in the end column some how. seq000001 Fc06 35.948-51.086 72 seq000002 Fc09 13.078-39.857 34

ADD REPLY
0
Entering edit mode

I can do this last bit in excel. Thanks this saved me so much time trying to get a perl script to dot he same!

ADD REPLY
0
Entering edit mode
$ datamash -s -g 1,2 min 3 max 3 count 3 <test.txt |  awk -v OFS="\t" '{print $1,$2,$3"-"$4,$5}' | sort -rn -k4 
seq000001   Fc06    35.948-51.086   72
seq000002   Fc09    13.078-39.857   34
seq000002   Fc06    63.754-63.754   2
seq000001   Fc08    37.009-37.916   2
seq000002   Fc11    19.59-19.59 1
ADD REPLY
3
Entering edit mode
5.9 years ago
microfuge ★ 1.9k

While the datamash solution is clearly the best, just for fun an attempt based on perl perl -MList::Util=min,max -lane 'BEGIN{%y}{$x=@F[0]."\t".$F[1];push @{$y{$x}},@F[-1] }END{foreach(keys %y){my @d=@{$y{$_}};print "$_\t",scalar(@d),"\t",min(@d),"--",max(@d) } }' inputfile.txt

  • seq000002 Fc11 1 19.59--19.59
  • seq000002 Fc09 34 13.078--39.857
  • seq000001 Fc08 2 37.009--37.916
  • seq000001 Fc06 72 35.948--51.086
  • seq000002 Fc06 2 63.754--63.754
ADD COMMENT
0
Entering edit mode

Nice to see a Perl programmer.

ADD REPLY
3
Entering edit mode
5.9 years ago

Another option is to use bedtools groupby:

sort -k1,1 -k2,2n data.txt | groupBy -g 1,2 -c 3,3,3 -o min,max,count

seq000001   Fc06    35.948  51.086  72
seq000001   Fc08    37.009  37.916  2
seq000002   Fc06    63.754  63.754  2
seq000002   Fc09    13.078  39.857  34
seq000002   Fc11    19.59   19.59   1
ADD COMMENT
2
Entering edit mode
5.9 years ago
Lisa Ha ▴ 120

That was a nice monday morning challenge for working on my python skills. The max value for the first sequence is different from your answer template, but consistent with the actual values in file2.

import csv
# open both files and store sequences from file1 as list
stream = open("file1.txt")
seqlist = list(stream)
seqlist = [element.strip() for element in seqlist]

stream = open("file2.txt")
reader = csv.reader(stream, delimiter="\t")

for seq in seqlist:
    # create array from column 2 and 3
    array = ([line[1:] for line in reader if seq == line[0]])
    # find and count most common element from column 2
    common = [element[0] for element in array]
    mostCommon = max(common, key=common.count)
    commonCount = common.count(mostCommon)
    # find range of values for most common element
    values = [element[1] for element in array if element[0] == mostCommon]
    # print out results
    print(seq, "\t", mostCommon, "\t", min(values), "-", max(values), "\t", commonCount)
    stream.seek(0)
ADD COMMENT

Login before adding your answer.

Traffic: 2555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6