Question: Count the sum of molecular markers and the respective centimorgan range from a list a sequences
0
gravatar for gabri
9 months ago by
gabri50
gabri50 wrote:

Hi All,

I have an excel table with 3 tab-delimited columns and n rows like this:

**header marker cm**

101 4   0-7.195

103 8   38.582-49.653

103 5   43.096-46.534

103 1   49.653-49.653

103 1   51.676-51.676

104 2   22.454-37.061

104 4   23.351-37.061

105 2   83.619-84.178

106 1   36.307-36.307

106 1   40.62-40.62

I need an output like this:

**header    marker  cm**

101 4   0-7.195

103 15  38.582-51.676

104 6   22.454-37.061

105 2   83.619-84.178

106 2   36.307-40.62

So, I need to collapse the duplicates in the header column, keeping the sum of the markers obtained from every row of that duplicate, and extract from the cm column the lowest and the highest value among all the rows of that duplicate.

For example, for the duplicate 103 the sum is 8 + 5 + 1 + 1, the lowest cm value is 38.582 in the 2nd row and the highest cm value is 51.676 in the 5th row. In the table there are also not duplicate headers (like the 101) that have to be keep directly as they are.

Any idea for an automated solution?

Thank you

ADD COMMENTlink modified 8 months ago • written 9 months ago by gabri50
1

Clippy says

ADD REPLYlink written 9 months ago by ATpoint14k

Hello,

This post looks off-topic for the site. Please substantiate how this question is related to bioinformatics.

ADD REPLYlink written 9 months ago by RamRS20k

Yes I'm going to explain it better. The "header" column represents my sequences, the "marker" column represents the number of molecular markers that I found on the sequences of the "header" column, the "cm" column represents the range position in centimorgan of these markers on my sequences.

ADD REPLYlink written 9 months ago by gabri50

Please edit your question and add this information in there.

ADD REPLYlink written 9 months ago by RamRS20k
2
gravatar for cpad0112
9 months ago by
cpad011211k
India
cpad011211k wrote:

output (please cross check):

$ tail -n+2 test.txt | sed -e 's/-/\t/g' | datamash -s -g 1 sum 2 min 3 max 4 | awk -v OFS="\t" 'BEGIN {print "header","market","cm"} {print $1,$2,$3"-"$4}'

header  market  cm
101 4   0-7.195
103 15  38.582-51.676
104 6   22.454-37.061
105 2   83.619-84.178
106 2   36.307-40.62

input:

$ cat test.txt 
header  marker  cm
101 4   0-7.195
103 8   38.582-49.653
103 5   43.096-46.534
103 1   49.653-49.653
103 1   51.676-51.676
104 2   22.454-37.061
104 4   23.351-37.061
105 2   83.619-84.178
106 1   36.307-36.307
106 1   40.62-40.62

datamash is in most of the gnu-linux repos (Use either apt or dnf/yum)

ADD COMMENTlink modified 9 months ago • written 9 months ago by cpad011211k

What's this datamash?

EDIT: Found it, never mind. IMO we should be using builtins or programming languages instead of utilities.

ADD REPLYlink modified 9 months ago • written 9 months ago by RamRS20k

Thank you for your help! I didn't know about datamash, very useful. Thank you again.

ADD REPLYlink written 9 months ago by gabri50
2
gravatar for b.nota
9 months ago by
b.nota6.4k
Netherlands
b.nota6.4k wrote:

Or use R plyr package, but first make a forth column (split the last column by the "-" sign).

library(plyr)

header <- c(101,103,103,103,103,104,104,105,106,106)
marker <- c(4,8,5,1,1,2,4,2,1,1)
cM_min <- c(0,38.582,43.096,49.653,51.676,22.454,23.351,83.619,36.307,40.62)
cM_max <- c(7.195,49.653,46.534,49.653,51.676,37.061,37.061,84.178,36.307,40.62)

df <- as.data.frame(cbind(header, marker, cM_min, cM_max))
df
   header marker   cM_min   cM_max
1   101    4  0.000  7.195
2   103    8 38.582 49.653
3   103    5 43.096 46.534
4   103    1 49.653 49.653
5   103    1 51.676 51.676
6   104    2 22.454 37.061
7   104    4 23.351 37.061
8   105    2 83.619 84.178
9   106    1 36.307 36.307
10  106    1 40.620 40.620

ddply(df, .(header), summarize,
 marker = sum(marker),
 cM_min = min(cM_min),
 cM_max = max(cM_max))

  header marker    cM_min    cM_max
1  101   4  0.000  7.195
2  103  15 38.582 51.676
3  104   6 22.454 37.061
4  105   2 83.619 84.178
5  106   2 36.307 40.620
ADD COMMENTlink modified 9 months ago • written 9 months ago by b.nota6.4k
2
gravatar for Gungor Budak
9 months ago by
Gungor Budak230
Ankara/Turkey
Gungor Budak230 wrote:

Don't use Excel for this task, use Python to read the input line by line, keep every line in a dictionary (key: header, value: list of marker and cm) and every time you read a line with header that is already in the dictionary, update the values accordingly and just output the content of the dictionary after you process all the lines. Try this and let us know how it goes! Good luck :)

ADD COMMENTlink written 9 months ago by Gungor Budak230
1
gravatar for RamRS
9 months ago by
RamRS20k
Houston, TX
RamRS20k wrote:
ADD COMMENTlink written 9 months ago by RamRS20k
1
gravatar for cpad0112
9 months ago by
cpad011211k
India
cpad011211k wrote:

in line with above post using stringr and dplyr:

df1 = read.csv("file.txt",stringsAsFactors = F,strip.white = T,sep = "\t")
library(stringr)
df1[, c("min", "max")] = str_split_fixed(df1$cm, "-", 2)
library(dplyr)
data.frame(df1 %>%
    group_by(header) %>%
    summarise(sum = sum(marker), range = paste(min(min), max(max), sep = "-")))

  header sum         range
1    101   4       0-7.195
2    103  15 38.582-51.676
3    104   6 22.454-37.061
4    105   2 83.619-84.178
5    106   2  36.307-40.62
ADD COMMENTlink modified 9 months ago • written 9 months ago by cpad011211k
1

Even this solution works really well, thank you very much!

ADD REPLYlink written 9 months ago by gabri50

Nice trick with the string split!

ADD REPLYlink written 9 months ago by b.nota6.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1336 users visited in the last hour