Grouping columns in a BED like file
1
0
Entering edit mode
7.1 years ago
ruchiksy ▴ 50

I have a BED like file with 19 columns which I have reduced down to the following format:

gencode.v19 A_Heart_AG A_Heart_BC A_Kidney_AG A_Kidney_BC A_Liver_AG A_Liver_BC A_Lung_AG A_Lung_BC A_Stomach_BC A_Stomach_OG_0288 A_Stomach_OG_0393 A_Stomach_OG_1840
1 0 0 1 1 1 1 1 1 1 1 1 0
1 1 0 1 1 1 0 0 1 1 1 1 0
0 0 0 0 1 0 0 0 0 0 0 0 0
1 0 1 1 1 1 1 1 1 1 1 1 0
1 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 0 1 1 1 0 1 0
1 0 0 1 1 1 0 0 1 0 0 1 0
1 1 1 1 0 1 0 1 1 0 1 1 0
0 0 0 1 0 0 0 0 0 0 0 0 0

I am looking to group all the tissues together like so: Heart_AG, Heart_BC would become just "Heart". So on and so forth. Then I want to take the resulting file and count how many times each library has an intron present. This is being done to create a 6 way venn diagram.

I thought of using an awk command but I would like to automate the process rather than massage the file un-necessarily.

How should I go about doing this?

**** Further Details ****

The 0's and 1's represent the presence of introns in various libraries. What I mean by grouping is to take for example "Heart" which has two vendors: Agilent and Biochain. Look for introns in either library and if they are present then count as "1", like so:

A_Heart_AG     A_Heart_BC   Count

1                        0                   1

0                        0                   0

1                        1                   1

This I would have to do for all libraries and then make a six way venn diagram. Six counting the Gencode annotations. The venn would be made by hand, or pass it through an R library which could do it for me.

python introns • 1.6k views
0
Entering edit mode

It's not quite clear to me exactly what you wish to do.

Does 0 and 1 represent intron yes/no at several different locations (rows)?

When you say "group" does that mean "sum values from all samples, by tissue"? Per row position? What does the final look like (if done by hand)?

You say six way Venn diagram, but there are only five different tissues in that textfile

On a side note I would just describe the file as white space separated - it hasn't got much to do with a bed file.

0
Entering edit mode

Amended the question.

1
Entering edit mode

There is nothing BED-like about that file. It looks like a matrix of features (columns) for libraries? (rows?)

If you want to do a disjunction operation, you could read each row of 0/1 values into an array. Then apply that OR or | boolean operation on subsets of columns (e.g. apply the operation on the values in columns 2 and 3 in each row, which gives you a "heart" value for that row, repeating for other pairs or triplets etc. of other tissue types).

When you finish processing a row of values into a smaller set of condensed values, print out a new row to standard output.

0
Entering edit mode
7.1 years ago
David Fredman ★ 1.1k

Here's a solution using R.

Step1: Rename column names to be just gencode or tissue name (in two steps of removing letters using regular expression matches, this could be compacted to a single step)

Step2: Sum counts across columns grouped by tissue name. This is done here using the rowsums function in the stats package. Because the rowsums function only operates by row and you wish to sum across columns, the matrix is transposed for the calc, and then transposed back for presentation via a nested function call.

dat = read.csv("your_table.csv")
labels = gsub("^\\w_", "", names(dat)) #remove prefix_
names(dat) = gsub("_\\w+\$", "", labels) #remove _suffix

library(stats)
t(rowsum(t(dat), names(dat)))