Question

How to use R's loop to get the part of the aggregate matrix other than the many subset matrices?

4

Entering edit mode

3.4 years ago

Chilly ★ 1.3k

I have two matrices of pure numbers, and their format is the same: the first column is the group name, the second is the starting number, and the third is the ending number. Each group is the name of a chromosome. All data were transformed from fasta data. I am trying to generate a blacklist(table3) with the whitelist(table2) by taking the whole genome(table1) as the base.

E.g (*spaces in the following tables represent column change): A row of table1:

scahrs1_999 1 12
#Means the total set of numbers for the 'scahrs1_999' group is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

Some rows of table2:

scahrs1_999 2 3
scahrs1_999 6 8
scahrs1_999 11 12
#Means that the subsets of numbers in the 'scahrs1_999' group are '2, 3', '5, 6, 7', and '10, 11'.

The result I want to get is a set of numbers that do not contain any subsets in the total set, and behaves the same as a subset of table2's consecutive numbers. which is:

table3 (results):

scahrs1_999 1
scahrs1_999 4 5
scahrs1_999 9 10
scahrs1_999 13 14

Also exclude subsets with only 1 number. which is:

table3 (final result):

scahrs1_999 4 5
scahrs1_999 9 10
scahrs1_999 13 14

As shown below, I have thousands of groups similar to the above 'scahrs1_999'. Obviously, I can't calculate them one by one. I know that R can loop through each group with a 'loop program' and get the corresponding result. But my programming ability is not up to this complex job.

Could someone give me some advice?

> Table1

1  scahrs1_1001  1     81142
2  scahrs1_1002  1     62661
3  scahrs1_1003  1    104891
4  scahrs1_1004  1     99296
5  scahrs1_1005  1     30919
6  scahrs1_1006  1     97599
7  scahrs1_1008  1     97078
8  scahrs1_1009  1     96958
9  scahrs1_1010  1     45677

> Table2

1  scahrs1_1001      1    753
2  scahrs1_1001  14931  15932
3  scahrs1_1001  17007  18008
4  scahrs1_1001  21211  22212
5  scahrs1_1001  40908  41909
6  scahrs1_1001  63233  64234
7  scahrs1_1001  76009  77010
8  scahrs1_1002   1068   2069
9  scahrs1_1002  12992  13993
10 scahrs1_1002  40448  41449
11 scahrs1_1003   2227   3228
12 scahrs1_1003  18453  19454
13 scahrs1_1003  28679  29680
14 scahrs1_1003  41161  42162
15 scahrs1_1003  41735  42736
16 scahrs1_1003  43040  44041
17 scahrs1_1003  64416  65417
18 scahrs1_1003  71219  72220
19 scahrs1_1003  96090  97091
20 scahrs1_1003  97306  98307
21 scahrs1_1004   1554   2555
22 scahrs1_1004  29086  30087
23 scahrs1_1004  44100  45101
24 scahrs1_1004  47799  48800
25 scahrs1_1004  59550  60551
26 scahrs1_1004  69356  70357
27 scahrs1_1004  71809  72810
28 scahrs1_1004  84272  85273
29 scahrs1_1004  89034  90035
30 scahrs1_1004  98627  99628
31 scahrs1_1005   6695   7696
32 scahrs1_1005  30160  31161
33 scahrs1_1006    298   1299
34 scahrs1_1006  70134  71135
35 scahrs1_1006  93750  94751
36 scahrs1_1008   3859   4860
37 scahrs1_1008   5575   6576
38 scahrs1_1008   7072   8073
39 scahrs1_1008   9342  10343
40 scahrs1_1008  11814  12815
41 scahrs1_1008  15290  16291
42 scahrs1_1008  40167  41168
43 scahrs1_1008  42890  43891
44 scahrs1_1008  44806  45807
45 scahrs1_1008  74442  75443
46 scahrs1_1008  82112  83113
47 scahrs1_1008  93766  94767
48 scahrs1_1008  95233  96234
49 scahrs1_1009   8000   9001
50 scahrs1_1009  37369  38370
51 scahrs1_1009  53086  54087
52 scahrs1_1009  83722  84723
53 scahrs1_1009  90044  91045
54 scahrs1_1010  11341  12342
55 scahrs1_1010  33500  34501
56 scahrs1_1010  34931  35932
57 scahrs1_1010  37937  38938

aggregate r matrix loop subset • 2.5k views

ADD COMMENT • link 3.4 years ago by Chilly ★ 1.3k

0

Entering edit mode

How is this related to bioinformatics?

ADD REPLY • link 3.4 years ago by Ram 45k

4

Entering edit mode

Yes. In fact, each group is the name of a chromosome. All data were transformed from fasta data. I am trying to generate a blacklist(table3) with the whitelist(table2) by taking the whole genome(table1) as the base.

ADD REPLY • link 3.4 years ago by Chilly ★ 1.3k

1

Entering edit mode

Please edit your question and add this information at the beginning.

ADD REPLY • link 3.4 years ago by Ram 45k

4

Entering edit mode

Thx, I have updated it.

ADD REPLY • link 3.4 years ago by Chilly ★ 1.3k

0

Entering edit mode

I am not entirely sure what you are trying to achieve, but I think the Genomic Ranges package, and it's interval operations (e.g. setdiff) respectively looping capabilities will help you a lot to implement it.

ADD REPLY • link 3.4 years ago by Matthias Zepper 5.1k

score 4 · Accepted Answer · 2022-06-01

Usually you want to operate on genomic ranges in R with GRanges objects.

Your example data.

table1 <- structure(list(V1 = "scahrs1_999", V2 = 1L, V3 = 12L), class = "data.frame", row.names = c(NA, 
-1L))

table2 <- structure(list(V1 = c("scahrs1_999", "scahrs1_999", "scahrs1_999"
), V2 = c(2L, 6L, 11L), V3 = c(3L, 8L, 12L)), class = "data.frame", row.names = c(NA, 
-3L))

GRanges answer.

library("GenomicRanges")

gr1 <- makeGRangesFromDataFrame(table1, seqnames.field="V1", start.field="V2", end.field="V3")
gr2 <- makeGRangesFromDataFrame(table2, seqnames.field="V1", start.field="V2", end.field="V3")

result <- setdiff(gr1, gr2)
result <- result[width(result) > 1]

And the result of the above code.

> result
GRanges object with 2 ranges and 0 metadata columns:
         seqnames    ranges strand
            <Rle> <IRanges>  <Rle>
  [1] scahrs1_999       4-5      *
  [2] scahrs1_999      9-10      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

There is a slight difference from your results because it doesn't follow from your data why there should be a range from 13-14.